linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
       [not found] <20200530055953.817666-1-krisman@collabora.com>
@ 2020-05-30 17:30 ` Kees Cook
  2020-05-31  5:56   ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 9+ messages in thread
From: Kees Cook @ 2020-05-30 17:30 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski,
	Will Drewry, H . Peter Anvin, Paul Gofman, linux-security-module

On Sat, May 30, 2020 at 01:59:53AM -0400, Gabriel Krisman Bertazi wrote:
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
> 
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux.  Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.

Interesting use-case! It seems like you're in the position of needing to
invert the assumption about syscalls: before you knew everything going
through WinAPI needed emulation, and now you need to assume everything
not going through a native library needs emulation. Oof.

Is it possible to disassemble and instrument the Windows code to insert
breakpoints (or emulation calls) at all the Windows syscall points?

> [...]
> * Why not SECCOMP_MODE_FILTER?
> 
> We experimented with dynamically generating BPF filters for whitelisted
> memory regions and using SECCOMP_MODE_FILTER, but there are a few
> reasons why it isn't enough nor a good idea for our use case:
> 
> 1. We cannot set the filters at program initialization time and forget
> about it, since there is no way of knowing which modules will be loaded,
> whether native and windows.  Filter would need a way to be updated
> frequently during game execution.
> 
> 2. We cannot predict which Linux libraries will issue syscalls directly.
> Most of the time, whitelisting libc and a few other libraries is enough,
> but there are no guarantees other Linux libraries won't issue syscalls
> directly and break the execution.  Adding every linux library that is
> loaded also has a large performance cost due to the large resulting
> filter.

Just so I can understand the expected use: given the dynamic nature of
the library loading, how would Wine be marking the VMAs?

> 3. As I mentioned before, performance is critical.  In our testing with
> just a single memory segment blacklisted/whitelisted, the minimum size
> of a bpf filter would be 4 instructions.  In that scenario,
> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
> time of sysinfo(2) in comparison to seccomp disabled, while the impact
> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.

Was the BPF JIT enabled? I was recently examining filter performance too:
https://lore.kernel.org/linux-security-module/202005291043.A63D910A8@keescook/

> Indeed, points 1 and 2 could be worked around with some userspace work
> and improved SECCOMP_MODE_FILTER support, but at a high performance and
> some stability cost, to obtain the semantics we want.  Still, the
> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
> enough that I believe it should be considered as an upstream solution.

It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?

> Sending as an RFC for now to get the discussion started.  In particular:
> 
> 1) Would this design be acceptable?

Maybe? I want to spend a little time exploring alternatives first. :)

> 2) I'm not comfortable with using the high part of vm_flags, though I'm
> not sure where else it would fit.  Suggestions?

I don't know this area very well. Hopefully some of the mm folks can
comment on it. It seems like there needs to be a more dynamic way mark
VMAs for arbitrary purposes (since this isn't _actually_ a PROT* flag,
it's just a marking for seccomp to check).

Regardless, here's a review of the specifics of patch, without regard to
the design question above. :)

> 
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Will Drewry <wad@chromium.org>
> CC: H. Peter Anvin <hpa@zytor.com>
> CC: Paul Gofman <gofmanp@gmail.com>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> ---
>  arch/Kconfig                           | 21 +++++++++
>  arch/x86/Kconfig                       |  1 +
>  include/linux/mm.h                     |  8 ++++
>  include/linux/mman.h                   |  4 +-
>  include/uapi/asm-generic/mman-common.h |  2 +
>  include/uapi/linux/seccomp.h           |  2 +
>  kernel/seccomp.c                       | 59 ++++++++++++++++++++++++++
>  mm/mprotect.c                          |  2 +-
>  8 files changed, 97 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 786a85d4ad40..e5c10231c9c2 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -471,6 +471,27 @@ config SECCOMP_FILTER
>  
>  	  See Documentation/userspace-api/seccomp_filter.rst for details.
>  
> +config HAVE_ARCH_SECCOMP_MEMMAP
> +	bool
> +	help
> +	  An arch should select this symbol if it provides all of these things:
> +	  - syscall_get_arch()
> +	  - syscall_get_arguments()
> +	  - syscall_rollback()
> +	  - syscall_set_return_value()
> +	  - SIGSYS siginfo_t support
> +	  - secure_computing is called from a ptrace_event()-safe context
> +	  - secure_computing return value is checked and a return value of -1
> +	    results in the system call being skipped immediately.
> +	  - seccomp syscall wired up
> +
> +config SECCOMP_MEMMAP
> +	def_bool y
> +	depends on HAVE_ARCH_SECCOMP_MEMMAP && SECCOMP
> +	help
> +	  Enable tasks to trap syscalls based on the page that triggered
> +	  the syscall.
> +

Since I don't see anything distinct in the requirements for
SECCOMP_MEMMAP, so I don't think it needs a separate CONFIG: it is almost
entirely built on the SECCOMP_FILTER infrastructure. I'd just drop the
CONFIGs.

>  config HAVE_ARCH_STACKLEAK
>  	bool
>  	help
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2d3f963fd6f1..70998b01c533 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -145,6 +145,7 @@ config X86
>  	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
>  	select HAVE_ARCH_PREL32_RELOCATIONS
>  	select HAVE_ARCH_SECCOMP_FILTER
> +	select HAVE_ARCH_SECCOMP_MEMMAP

(and it's not arch-specific)

>  	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>  	select HAVE_ARCH_STACKLEAK
>  	select HAVE_ARCH_TRACEHOOK
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a323422d783..6271fb7fd5bc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -294,11 +294,13 @@ extern unsigned int kobjsize(const void *objp);
>  #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
> +#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
>  #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
>  #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
>  #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
>  #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
>  #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
> +#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
>  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
>  
>  #ifdef CONFIG_ARCH_HAS_PKEYS
> @@ -340,6 +342,12 @@ extern unsigned int kobjsize(const void *objp);
>  # define VM_GROWSUP	VM_NONE
>  #endif
>  
> +#if defined(CONFIG_X86)
> +# define VM_NOSYSCALL	VM_HIGH_ARCH_5
> +#else
> +# define VM_NOSYSCALL	VM_NONE
> +#endif
> +
>  /* Bits set in the VMA until the stack is in its final location */
>  #define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
>  

I have no idea what the correct way to mark VMAs would be. As you
mentioned, this seems ... wrong. :)

> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 4b08e9c9c538..a5ca42eb685a 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages)
>   */
>  static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
>  {
> -	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
> +	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
> +			 | PROT_NOSYSCALL)) == 0;
>  }
>  #define arch_validate_prot arch_validate_prot
>  #endif
> @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
>  	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
>  	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
>  	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
> +	       _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
>  	       arch_calc_vm_prot_bits(prot, pkey);
>  }
>  
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index f94f65d429be..4eafbaa3f4bc 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -13,6 +13,8 @@
>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
>  /*			0x10		   reserved for arch-specific use */
>  /*			0x20		   reserved for arch-specific use */
> +#define PROT_NOSYSCALL	0x40		/* page cannot trigger syscalls */
> +
>  #define PROT_NONE	0x0		/* page can not be accessed */
>  #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
>  #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */

AIUI, all of the above is to plumb the VMA marking through an mprotect()
call. I wonder if perhaps madvise() would be better? I'm not sure how
tight we are on new flags there, but I think it would be cleaner to use
that interface. Take a look at MADV_WIPEONFORK / VM_WIPEONFORK.

> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index c1735455bc53..c7d042a409e7 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -10,12 +10,14 @@
>  #define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
>  #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
>  #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
> +#define SECCOMP_MODE_MEMMAP	3 /* Lock syscalls per memory region. */

Making this incompatible with FILTER might cause problems for the future
(more and more process launchers are starting to set filters).

So this would, perhaps, be a new flag for struct seccomp, rather than a
new operating mode.

>  
>  /* Valid operations for seccomp syscall. */
>  #define SECCOMP_SET_MODE_STRICT		0
>  #define SECCOMP_SET_MODE_FILTER		1
>  #define SECCOMP_GET_ACTION_AVAIL	2
>  #define SECCOMP_GET_NOTIF_SIZES		3
> +#define SECCOMP_SET_MODE_MEMMAP		4
>  
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>  #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 55a6184f5990..ebf09b02db8d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>  }
>  #endif
>  
> +#ifdef CONFIG_SECCOMP_MEMMAP
> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
> +{
> +	struct vm_area_struct *vma = find_vma(current->mm,
> +					      sd->instruction_pointer);
> +	if (!vma)
> +		BUG();

No new kernel code should use BUG:
https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on

I would maybe pr_warn_once(), but then treat it as if it was marked with
VM_NOSYSCALL.

> +
> +	if (!(vma->vm_flags & VM_NOSYSCALL))
> +		return 0;
> +
> +	syscall_rollback(current, task_pt_regs(current));
> +	seccomp_send_sigsys(this_syscall, 0);
> +
> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
> +
> +	return -1;
> +}

This really just looks like an ip_address filter, but I get what you
mean about stacking filters, etc. This may finally be the day we turn to
eBPF in seccomp, since that would give you access to using a map lookup
on ip_address, and the map could be updated externally (removing the
need for the madvise() changes).

> +
> +static long seccomp_set_mode_memmap(unsigned int flags)
> +{
> +	const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
> +	long ret = 0;
> +
> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> +		return -EINVAL;

This should just test for any bits in flags. TSYNC + memmap should be
avoided, IMO.

> +
> +	spin_lock_irq(&current->sighand->siglock);
> +
> +	if (seccomp_may_assign_mode(seccomp_mode))
> +		seccomp_assign_mode(current, seccomp_mode, flags);
> +	else
> +		ret = -EINVAL;
> +
> +	spin_unlock_irq(&current->sighand->siglock);
> +
> +	return ret;
> +}
> +#else
> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
> +{
> +	BUG();
> +}
> +static long seccomp_set_mode_memmap(unsigned int flags)
> +{
> +	return -EINVAL;
> +}
> +#endif /* CONFIG_SECCOMP_MEMMAP */
> +
>  int __secure_computing(const struct seccomp_data *sd)
>  {
>  	int mode = current->seccomp.mode;
> @@ -948,6 +997,8 @@ int __secure_computing(const struct seccomp_data *sd)
>  		return 0;
>  	case SECCOMP_MODE_FILTER:
>  		return __seccomp_filter(this_syscall, sd, false);
> +	case SECCOMP_MODE_MEMMAP:
> +		return __seccomp_memmap(this_syscall, sd);
>  	default:
>  		BUG();
>  	}
> @@ -1425,6 +1476,10 @@ static long do_seccomp(unsigned int op, unsigned int flags,
>  			return -EINVAL;
>  
>  		return seccomp_get_notif_sizes(uargs);
> +	case SECCOMP_SET_MODE_MEMMAP:
> +		if (uargs != NULL)
> +			return -EINVAL;
> +		return seccomp_set_mode_memmap(flags);
>  	default:
>  		return -EINVAL;
>  	}
> @@ -1462,6 +1517,10 @@ long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter)
>  		op = SECCOMP_SET_MODE_FILTER;
>  		uargs = filter;
>  		break;
> +	case SECCOMP_MODE_MEMMAP:
> +		op = SECCOMP_SET_MODE_MEMMAP;
> +		uargs = NULL;
> +		break;

The prctl() interface is deprecated, so no new features should be added
there.

>  	default:
>  		return -EINVAL;
>  	}
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 494192ca954b..6b5c00e8bbdc 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>  		 * cleared from the VMA.
>  		 */
>  		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
> -					VM_FLAGS_CLEAR;
> +			VM_NOSYSCALL | VM_FLAGS_CLEAR;
>  
>  		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
>  		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
> -- 
> 2.27.0.rc2
> 

So, yes, interesting. I'll continue to think about how this could work
best. Can you share what your filters looked like when you were
originally trying those?

Thanks!

-Kees

-- 
Kees Cook

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-30 17:30 ` [PATCH RFC] seccomp: Implement syscall isolation based on memory areas Kees Cook
@ 2020-05-31  5:56   ` Gabriel Krisman Bertazi
  2020-05-31 12:39     ` Paul Gofman
  0 siblings, 1 reply; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-05-31  5:56 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski,
	Will Drewry, H . Peter Anvin, Paul Gofman, linux-security-module

Kees Cook <keescook@chromium.org> writes:

> On Sat, May 30, 2020 at 01:59:53AM -0400, Gabriel Krisman Bertazi wrote:
>> Modern Windows applications are executing system call instructions
>> directly from the application's code without going through the WinAPI.
>> This breaks Wine emulation, because it doesn't have a chance to
>> intercept and emulate these syscalls before they are submitted to Linux.
>> 
>> In addition, we cannot simply trap every system call of the application
>> to userspace using PTRACE_SYSEMU, because performance would suffer,
>> since our main use case is to run Windows games over Linux.  Therefore,
>> we need some in-kernel filtering to decide whether the syscall was
>> issued by the wine code or by the windows application.
>
> Interesting use-case! It seems like you're in the position of needing to
> invert the assumption about syscalls: before you knew everything going
> through WinAPI needed emulation, and now you need to assume everything
> not going through a native library needs emulation. Oof.
>
> Is it possible to disassemble and instrument the Windows code to insert
> breakpoints (or emulation calls) at all the Windows syscall points?

Hi Kees,

I considered instrumenting the syscall instructions with calls to some
wrapper, but I was told that modifying the game in memory or on disk
will trigger all sorts of anti-cheating mechanisms (my main use case are
windows games).

>> [...]
>> * Why not SECCOMP_MODE_FILTER?
>> 
>> We experimented with dynamically generating BPF filters for whitelisted
>> memory regions and using SECCOMP_MODE_FILTER, but there are a few
>> reasons why it isn't enough nor a good idea for our use case:
>> 
>> 1. We cannot set the filters at program initialization time and forget
>> about it, since there is no way of knowing which modules will be loaded,
>> whether native and windows.  Filter would need a way to be updated
>> frequently during game execution.
>> 
>> 2. We cannot predict which Linux libraries will issue syscalls directly.
>> Most of the time, whitelisting libc and a few other libraries is enough,
>> but there are no guarantees other Linux libraries won't issue syscalls
>> directly and break the execution.  Adding every linux library that is
>> loaded also has a large performance cost due to the large resulting
>> filter.
>
> Just so I can understand the expected use: given the dynamic nature of
> the library loading, how would Wine be marking the VMAs?

Paul (cc'ed) is the wine expert, but my understanding is that memory
allocation and initial program load of the emulated binary will go
through wine.  It does the allocation and mark the vma accordingly
before returning the allocated range to the windows application.

>> 3. As I mentioned before, performance is critical.  In our testing with
>> just a single memory segment blacklisted/whitelisted, the minimum size
>> of a bpf filter would be 4 instructions.  In that scenario,
>> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
>> time of sysinfo(2) in comparison to seccomp disabled, while the impact
>> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.
>
> Was the BPF JIT enabled? I was recently examining filter performance too:
> https://lore.kernel.org/linux-security-module/202005291043.A63D910A8@keescook/

yes:

root@dilma:~# sysctl -a | grep -i jit
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0

>
>> Indeed, points 1 and 2 could be worked around with some userspace work
>> and improved SECCOMP_MODE_FILTER support, but at a high performance and
>> some stability cost, to obtain the semantics we want.  Still, the
>> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
>> enough that I believe it should be considered as an upstream solution.
>
> It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
> can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?

I experimented with SECCOMP_RET_TRAP and SECCOMP_RET_TRACE, but I hadn't
consider USER_NOTIF.  It seems to be a quite recent feature and I wasn't
aware of it.  I will try it and let you know.

>> diff --git a/include/linux/mman.h b/include/linux/mman.h
>> index 4b08e9c9c538..a5ca42eb685a 100644
>> --- a/include/linux/mman.h
>> +++ b/include/linux/mman.h
>> @@ -94,7 +94,8 @@ static inline void vm_unacct_memory(long pages)
>>   */
>>  static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
>>  {
>> -	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
>> +	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM
>> +			 | PROT_NOSYSCALL)) == 0;
>>  }
>>  #define arch_validate_prot arch_validate_prot
>>  #endif
>> @@ -119,6 +120,7 @@ calc_vm_prot_bits(unsigned long prot, unsigned long pkey)
>>  	return _calc_vm_trans(prot, PROT_READ,  VM_READ ) |
>>  	       _calc_vm_trans(prot, PROT_WRITE, VM_WRITE) |
>>  	       _calc_vm_trans(prot, PROT_EXEC,  VM_EXEC) |
>> +	       _calc_vm_trans(prot, PROT_NOSYSCALL, VM_NOSYSCALL) |
>>  	       arch_calc_vm_prot_bits(prot, pkey);
>>  }
>>  
>> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
>> index f94f65d429be..4eafbaa3f4bc 100644
>> --- a/include/uapi/asm-generic/mman-common.h
>> +++ b/include/uapi/asm-generic/mman-common.h
>> @@ -13,6 +13,8 @@
>>  #define PROT_SEM	0x8		/* page may be used for atomic ops */
>>  /*			0x10		   reserved for arch-specific use */
>>  /*			0x20		   reserved for arch-specific use */
>> +#define PROT_NOSYSCALL	0x40		/* page cannot trigger syscalls */
>> +
>>  #define PROT_NONE	0x0		/* page can not be accessed */
>>  #define PROT_GROWSDOWN	0x01000000	/* mprotect flag: extend change to start of growsdown vma */
>>  #define PROT_GROWSUP	0x02000000	/* mprotect flag: extend change to end of growsup vma */
>
> AIUI, all of the above is to plumb the VMA marking through an mprotect()
> call. I wonder if perhaps madvise() would be better? I'm not sure how
> tight we are on new flags there, but I think it would be cleaner to use
> that interface. Take a look at MADV_WIPEONFORK / VM_WIPEONFORK.

Right, will do.  I considered it, but wasn't sure if madvise semantics
gave any guarantees of the result, but is seems to be the case for at
least for WIPEONFORK.  it is important to have it configurable on mmap
too, to avoid the extra syscall.  Plumbing through the mmap flags
argument shouldn't be a problem.

>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>> index c1735455bc53..c7d042a409e7 100644
>> --- a/include/uapi/linux/seccomp.h
>> +++ b/include/uapi/linux/seccomp.h
>> @@ -10,12 +10,14 @@
>>  #define SECCOMP_MODE_DISABLED	0 /* seccomp is not in use. */
>>  #define SECCOMP_MODE_STRICT	1 /* uses hard-coded filter. */
>>  #define SECCOMP_MODE_FILTER	2 /* uses user-supplied filter. */
>> +#define SECCOMP_MODE_MEMMAP	3 /* Lock syscalls per memory region. */
>
> Making this incompatible with FILTER might cause problems for the future
> (more and more process launchers are starting to set filters).
>
> So this would, perhaps, be a new flag for struct seccomp, rather than a
> new operating mode.

Yes, agreed.  I guess it would be a real scenario, even for us, to have
isolation against malicious syscalls (MODE_FILTER) combined
with region protection (MEMAP) for emulation.

>
>>  
>>  /* Valid operations for seccomp syscall. */
>>  #define SECCOMP_SET_MODE_STRICT		0
>>  #define SECCOMP_SET_MODE_FILTER		1
>>  #define SECCOMP_GET_ACTION_AVAIL	2
>>  #define SECCOMP_GET_NOTIF_SIZES		3
>> +#define SECCOMP_SET_MODE_MEMMAP		4
>>  
>>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
>>  #define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 55a6184f5990..ebf09b02db8d 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -930,6 +930,55 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>>  }
>>  #endif
>>  
>> +#ifdef CONFIG_SECCOMP_MEMMAP
>> +static int __seccomp_memmap(int this_syscall, const struct seccomp_data *sd)
>> +{
>> +	struct vm_area_struct *vma = find_vma(current->mm,
>> +					      sd->instruction_pointer);
>> +	if (!vma)
>> +		BUG();
>
> No new kernel code should use BUG:
> https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on
>
> I would maybe pr_warn_once(), but then treat it as if it was marked with
> VM_NOSYSCALL.
>
>> +
>> +	if (!(vma->vm_flags & VM_NOSYSCALL))
>> +		return 0;
>> +
>> +	syscall_rollback(current, task_pt_regs(current));
>> +	seccomp_send_sigsys(this_syscall, 0);
>> +
>> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
>> +
>> +	return -1;
>> +}
>
> This really just looks like an ip_address filter, but I get what you
> mean about stacking filters, etc. This may finally be the day we turn to
> eBPF in seccomp, since that would give you access to using a map lookup
> on ip_address, and the map could be updated externally (removing the
> need for the madvise() changes).

And 64-bit comparisons :)

that would be a good solution, we'd still need to look at some details,
like disabling/updating filters at runtime when some new library is
loaded, but since we can update externally, I think it covers it.

From the wine side, it should be enough to get the semantics we want,
but I don't know about performance.

>> +
>> +static long seccomp_set_mode_memmap(unsigned int flags)
>> +{
>> +	const unsigned long seccomp_mode = SECCOMP_MODE_MEMMAP;
>> +	long ret = 0;
>> +
>> +	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>> +		return -EINVAL;
>
> This should just test for any bits in flags. TSYNC + memmap should be
> avoided, IMO.

Maybe it is worth to expose SECCOMP_FILTER_FLAG_LOG?

>>  	default:
>>  		return -EINVAL;
>>  	}
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 494192ca954b..6b5c00e8bbdc 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -591,7 +591,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>>  		 * cleared from the VMA.
>>  		 */
>>  		mask_off_old_flags = VM_READ | VM_WRITE | VM_EXEC |
>> -					VM_FLAGS_CLEAR;
>> +			VM_NOSYSCALL | VM_FLAGS_CLEAR;
>>  
>>  		new_vma_pkey = arch_override_mprotect_pkey(vma, prot, pkey);
>>  		newflags = calc_vm_prot_bits(prot, new_vma_pkey);
>> -- 
>> 2.27.0.rc2
>> 
>
> So, yes, interesting. I'll continue to think about how this could work
> best. Can you share what your filters looked like when you were
> originally trying those?

The results I shared in the previous email had the following filter:

  BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))),
  BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, 0xf000, 0, 1),
  BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),
  BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

Which is much much simpler than the real thing.  I also had a small
filter generator that generated segments like the following ( ip_HW and
ip_LW are the high and low part of the instruction pointer):

  0007: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, ip_HW)
  0008: BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, EH, 0, REL_ADDR(next))
  0009: BPF_STMT(BPF_LD | BPF_W | BPF_ABS, ip_LW)
  000a: BPF_JUMP(BPF_JMP | BPF_JGT | BPF_K, EL, REL_ADDR(next), 0)
  000b: BPF_JUMP(BPF_JMP | BPF_JGE | BPF_K, SL, REL_ADDR(filter->jump_allow_target), REL_ADDR(next))
  000c: next:

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-31  5:56   ` Gabriel Krisman Bertazi
@ 2020-05-31 12:39     ` Paul Gofman
  2020-05-31 16:49       ` Matthew Wilcox
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Gofman @ 2020-05-31 12:39 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Kees Cook
  Cc: linux-mm, linux-kernel, kernel, Thomas Gleixner, Andy Lutomirski,
	Will Drewry, H . Peter Anvin, linux-security-module,
	Zebediah Figura

Hi!

thanks for looking into this.

On 5/31/20 08:56, Gabriel Krisman Bertazi wrote:
>
>> Is it possible to disassemble and instrument the Windows code to insert
>> breakpoints (or emulation calls) at all the Windows syscall points?
> Hi Kees,
>
> I considered instrumenting the syscall instructions with calls to some
> wrapper, but I was told that modifying the game in memory or on disk
> will trigger all sorts of anti-cheating mechanisms (my main use case are
> windows games).

Yes, this is the case. Besides, before instrumenting, we would need some
way to find those syscalls in the highly obfuscated dynamically
generated code, the whole purpose of which is to prevent disassembling,
debugging and finding things like that in it.

Ultimately, even for the cases when it would be technically feasible I
don't think Wine could ever go for modifying the program's code (unless
of course this is the part of what the program is doing itself using
winapi calls). Wine is trying to implement the API as close to Windows
as possible so the DRM can work with Wine, modifying the program's code
is something different.

>
>>> [...]
>>> * Why not SECCOMP_MODE_FILTER?
>>>
>>> We experimented with dynamically generating BPF filters for whitelisted
>>> memory regions and using SECCOMP_MODE_FILTER, but there are a few
>>> reasons why it isn't enough nor a good idea for our use case:
>>>
>>> 1. We cannot set the filters at program initialization time and forget
>>> about it, since there is no way of knowing which modules will be loaded,
>>> whether native and windows.  Filter would need a way to be updated
>>> frequently during game execution.
>>>
>>> 2. We cannot predict which Linux libraries will issue syscalls directly.
>>> Most of the time, whitelisting libc and a few other libraries is enough,
>>> but there are no guarantees other Linux libraries won't issue syscalls
>>> directly and break the execution.  Adding every linux library that is
>>> loaded also has a large performance cost due to the large resulting
>>> filter.
>> Just so I can understand the expected use: given the dynamic nature of
>> the library loading, how would Wine be marking the VMAs?
> Paul (cc'ed) is the wine expert, but my understanding is that memory
> allocation and initial program load of the emulated binary will go
> through wine.  It does the allocation and mark the vma accordingly
> before returning the allocated range to the windows application.
Yes, exactly. Pretty much any memory allocation which Wine does needs
syscalls (if those are ever encountered later during executing code from
those areas) to be trapped by Wine and passed to Wine's implementation
of the corresponding Windows API function. Linux native libraries
loading and memory allocations performed by them go outside of Wine control.
>
>>> Indeed, points 1 and 2 could be worked around with some userspace work
>>> and improved SECCOMP_MODE_FILTER support, but at a high performance and
>>> some stability cost, to obtain the semantics we want.  Still, the
>>> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
>>> enough that I believe it should be considered as an upstream solution.
>> It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
>> can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?

We are not much concerned with the overhead of the trapped syscall in
our use case, those are very rare. What we are concerned with is the
performance impact on the normal Linux syscalls when the syscall
trapping is enabled. When I was measuring that impact I've got the same
10% overhead for the untrapped syscalls (that is, always hitting
SECCOMP_RET_ALLOW case) with the filters Gabriel mentioned.


>>
>>> +
>>> +	if (!(vma->vm_flags & VM_NOSYSCALL))
>>> +		return 0;
>>> +
>>> +	syscall_rollback(current, task_pt_regs(current));
>>> +	seccomp_send_sigsys(this_syscall, 0);
>>> +
>>> +	seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
>>> +
>>> +	return -1;
>>> +}
>> This really just looks like an ip_address filter, but I get what you
>> mean about stacking filters, etc. This may finally be the day we turn to
>> eBPF in seccomp, since that would give you access to using a map lookup
>> on ip_address, and the map could be updated externally (removing the
>> need for the madvise() changes).
> And 64-bit comparisons :)
>
> that would be a good solution, we'd still need to look at some details,
> like disabling/updating filters at runtime when some new library is
> loaded, but since we can update externally, I think it covers it.
I am afraid that for a general case the filter add / update / removal
should be done not just when a new library is loaded or unloaded but
pretty much any time new (executable) memory region is allocated or
deallocated, or PROT_EXEC should be changed on allocated pages . But I
am not sure I've got enough details yet on the suggested approach here
and might be missing important details. I guess maybe we could discuss
the details separately with Gabriel first.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-31 12:39     ` Paul Gofman
@ 2020-05-31 16:49       ` Matthew Wilcox
  2020-05-31 17:10         ` Paul Gofman
  2020-06-01 17:53         ` Gabriel Krisman Bertazi
  0 siblings, 2 replies; 9+ messages in thread
From: Matthew Wilcox @ 2020-05-31 16:49 UTC (permalink / raw)
  To: Paul Gofman
  Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel,
	kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry,
	H . Peter Anvin, linux-security-module, Zebediah Figura

On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
> > Paul (cc'ed) is the wine expert, but my understanding is that memory
> > allocation and initial program load of the emulated binary will go
> > through wine.  It does the allocation and mark the vma accordingly
> > before returning the allocated range to the windows application.
> Yes, exactly. Pretty much any memory allocation which Wine does needs
> syscalls (if those are ever encountered later during executing code from
> those areas) to be trapped by Wine and passed to Wine's implementation
> of the corresponding Windows API function. Linux native libraries
> loading and memory allocations performed by them go outside of Wine control.

I don't like Gabriel's approach very much.  Could we do something like
issue a syscall before executing a Windows region and then issue another
syscall when exiting?  If so, we could switch the syscall entry point (ie
change MSR_LSTAR).  I'm thinking something like a personality() syscall.
But maybe that would be too high an overhead.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-31 16:49       ` Matthew Wilcox
@ 2020-05-31 17:10         ` Paul Gofman
  2020-05-31 17:31           ` Matthew Wilcox
  2020-06-01 17:53         ` Gabriel Krisman Bertazi
  1 sibling, 1 reply; 9+ messages in thread
From: Paul Gofman @ 2020-05-31 17:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel,
	kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry,
	H . Peter Anvin, linux-security-module, Zebediah Figura

On 5/31/20 19:49, Matthew Wilcox wrote:
> On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
>>> Paul (cc'ed) is the wine expert, but my understanding is that memory
>>> allocation and initial program load of the emulated binary will go
>>> through wine.  It does the allocation and mark the vma accordingly
>>> before returning the allocated range to the windows application.
>> Yes, exactly. Pretty much any memory allocation which Wine does needs
>> syscalls (if those are ever encountered later during executing code from
>> those areas) to be trapped by Wine and passed to Wine's implementation
>> of the corresponding Windows API function. Linux native libraries
>> loading and memory allocations performed by them go outside of Wine control.
> I don't like Gabriel's approach very much.  Could we do something like
> issue a syscall before executing a Windows region and then issue another
> syscall when exiting?  If so, we could switch the syscall entry point (ie
> change MSR_LSTAR).  I'm thinking something like a personality() syscall.
> But maybe that would be too high an overhead.
>
IIRC Gabriel had such idea that we discussed. We can potentially track
the boundary between the Windows and native code exectution. But issuing
syscall every time we cross that boundary may have a prohibitive
performance impact, that happens way too often. What we could do is to
put the flag somewhere, but that flag has to be per thread. E. g., we
could use Linux gs: based thread local storage, or fs: based address
(that's what Windows using for thread local data and thus Wine maintains
also). If Seccomp filters could access such a memory location (fetch a
byte from there and put into the structure accessible by BPF_LD) we
could use SECCOMP_MODE_FILTER, I think.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-31 17:10         ` Paul Gofman
@ 2020-05-31 17:31           ` Matthew Wilcox
  2020-05-31 18:01             ` Paul Gofman
  0 siblings, 1 reply; 9+ messages in thread
From: Matthew Wilcox @ 2020-05-31 17:31 UTC (permalink / raw)
  To: Paul Gofman
  Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel,
	kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry,
	H . Peter Anvin, linux-security-module, Zebediah Figura

On Sun, May 31, 2020 at 08:10:18PM +0300, Paul Gofman wrote:
> On 5/31/20 19:49, Matthew Wilcox wrote:
> > On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
> >>> Paul (cc'ed) is the wine expert, but my understanding is that memory
> >>> allocation and initial program load of the emulated binary will go
> >>> through wine.  It does the allocation and mark the vma accordingly
> >>> before returning the allocated range to the windows application.
> >> Yes, exactly. Pretty much any memory allocation which Wine does needs
> >> syscalls (if those are ever encountered later during executing code from
> >> those areas) to be trapped by Wine and passed to Wine's implementation
> >> of the corresponding Windows API function. Linux native libraries
> >> loading and memory allocations performed by them go outside of Wine control.
> > I don't like Gabriel's approach very much.  Could we do something like
> > issue a syscall before executing a Windows region and then issue another
> > syscall when exiting?  If so, we could switch the syscall entry point (ie
> > change MSR_LSTAR).  I'm thinking something like a personality() syscall.
> > But maybe that would be too high an overhead.
>
> IIRC Gabriel had such idea that we discussed. We can potentially track
> the boundary between the Windows and native code exectution. But issuing
> syscall every time we cross that boundary may have a prohibitive
> performance impact, that happens way too often. What we could do is to
> put the flag somewhere, but that flag has to be per thread. E. g., we
> could use Linux gs: based thread local storage, or fs: based address
> (that's what Windows using for thread local data and thus Wine maintains
> also). If Seccomp filters could access such a memory location (fetch a
> byte from there and put into the structure accessible by BPF_LD) we
> could use SECCOMP_MODE_FILTER, I think.

If it's the cost of the syscall that's the problem, there are ways
around that.  We'd still want a personality() call to indicate that
the syscall handler should look (somewhere) to determine the current
personality, but that could be issued at the start of execution rather
than when we switch between Windows & Linux code.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-31 17:31           ` Matthew Wilcox
@ 2020-05-31 18:01             ` Paul Gofman
  2020-06-01 17:54               ` Gabriel Krisman Bertazi
  0 siblings, 1 reply; 9+ messages in thread
From: Paul Gofman @ 2020-05-31 18:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gabriel Krisman Bertazi, Kees Cook, linux-mm, linux-kernel,
	kernel, Thomas Gleixner, Andy Lutomirski, Will Drewry,
	H . Peter Anvin, linux-security-module, Zebediah Figura

On 5/31/20 20:31, Matthew Wilcox wrote:
> If it's the cost of the syscall that's the problem, there are ways
> around that.  We'd still want a personality() call to indicate that
> the syscall handler should look (somewhere) to determine the current
> personality, but that could be issued at the start of execution rather
> than when we switch between Windows & Linux code.

Sure, we can call personality() at start and specify the location to
look at, the only thing is that the location should be thread specific,
that is, based on fs: or gs: or whatever else which would allow us to
have different threads in different "personality" state. If anything
needs to be set up at thread start we can do that also of course.

If there will be any proof of concept solution I will be happy to make a
proof of concept Wine patch using that and do some testing.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-31 16:49       ` Matthew Wilcox
  2020-05-31 17:10         ` Paul Gofman
@ 2020-06-01 17:53         ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-06-01 17:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Paul Gofman, Kees Cook, linux-mm, linux-kernel, kernel,
	Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin,
	linux-security-module, Zebediah Figura

Matthew Wilcox <willy@infradead.org> writes:

> On Sun, May 31, 2020 at 03:39:33PM +0300, Paul Gofman wrote:
>> > Paul (cc'ed) is the wine expert, but my understanding is that memory
>> > allocation and initial program load of the emulated binary will go
>> > through wine.  It does the allocation and mark the vma accordingly
>> > before returning the allocated range to the windows application.
>> Yes, exactly. Pretty much any memory allocation which Wine does needs
>> syscalls (if those are ever encountered later during executing code from
>> those areas) to be trapped by Wine and passed to Wine's implementation
>> of the corresponding Windows API function. Linux native libraries
>> loading and memory allocations performed by them go outside of Wine control.
>
> I don't like Gabriel's approach very much.  Could we do something like

Hi Matthew,

I don't oppose your suggestion, as Paul said, it should be enough for
us.  But could you elaborate on the problems you see in the original
approach, even if only for my own education?

> issue a syscall before executing a Windows region and then issue another
> syscall when exiting?  If so, we could switch the syscall entry point (ie
> change MSR_LSTAR).  I'm thinking something like a personality() syscall.
> But maybe that would be too high an overhead.
>

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
  2020-05-31 18:01             ` Paul Gofman
@ 2020-06-01 17:54               ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 9+ messages in thread
From: Gabriel Krisman Bertazi @ 2020-06-01 17:54 UTC (permalink / raw)
  To: Paul Gofman
  Cc: Matthew Wilcox, Kees Cook, linux-mm, linux-kernel, kernel,
	Thomas Gleixner, Andy Lutomirski, Will Drewry, H . Peter Anvin,
	linux-security-module, Zebediah Figura

Paul Gofman <gofmanp@gmail.com> writes:

> On 5/31/20 20:31, Matthew Wilcox wrote:
>> If it's the cost of the syscall that's the problem, there are ways
>> around that.  We'd still want a personality() call to indicate that
>> the syscall handler should look (somewhere) to determine the current
>> personality, but that could be issued at the start of execution rather
>> than when we switch between Windows & Linux code.
>
> Sure, we can call personality() at start and specify the location to
> look at, the only thing is that the location should be thread specific,
> that is, based on fs: or gs: or whatever else which would allow us to
> have different threads in different "personality" state. If anything
> needs to be set up at thread start we can do that also of course.
>
> If there will be any proof of concept solution I will be happy to make a
> proof of concept Wine patch using that and do some testing.

Let me give that a try and share the patches with you, so we can look at
how this implementation would look like.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-06-01 17:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200530055953.817666-1-krisman@collabora.com>
2020-05-30 17:30 ` [PATCH RFC] seccomp: Implement syscall isolation based on memory areas Kees Cook
2020-05-31  5:56   ` Gabriel Krisman Bertazi
2020-05-31 12:39     ` Paul Gofman
2020-05-31 16:49       ` Matthew Wilcox
2020-05-31 17:10         ` Paul Gofman
2020-05-31 17:31           ` Matthew Wilcox
2020-05-31 18:01             ` Paul Gofman
2020-06-01 17:54               ` Gabriel Krisman Bertazi
2020-06-01 17:53         ` Gabriel Krisman Bertazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).