All of lore.kernel.org
 help / color / mirror / Atom feed
* Detecting user data on base types
@ 2019-05-29 18:47 Andrew Murray
  2019-05-29 19:49 ` Dan Carpenter
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Murray @ 2019-05-29 18:47 UTC (permalink / raw)
  To: smatch; +Cc: dan.carpenter

Hello,

I'm investigating the use of smatch to identify user provided data in specific
kernel functions, however I'm having difficulty in making this work correctly.

ARM64 provides a feature where the top byte of a pointer is ignored by the
hardware and as a result userspace can use this byte for its own purpose. The
ABI is being changed meaning that userspace is permitted to pass pointers to
the kernel with the top byte being non-zero. These pointers get into the kernel
via a variety of ioctls and syscalls and are usually passed around as unsigned
longs. However, despite the hardware ignoring the top byte, many memory related
kernel functions don't handle this and so the top byte must be untagged/stripped
prior to entering these functions (e.g. [1]).

I'd like to make use of smatch to identify code that doesn't untag memory
addresses where needed. My initial approach was to use sparse, and to annotate
functions that required an untagged address as follows:

--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -19,6 +19,7 @@
 # define __cond_lock(x,c)      ((c) ? ({ __acquire(x); 1; }) : 0)
 # define __percpu      __attribute__((noderef, address_space(3)))
 # define __rcu         __attribute__((noderef, address_space(4)))
+# define __untagged    __attribute__((address_space(5)))
 # define __private     __attribute__((noderef))
 extern void __chk_user_ptr(const volatile void __user *);
 extern void __chk_io_ptr(const volatile void __iomem *);

--- a/evaluate.c
+++ b/evaluate.c
@@ -1470,6 +1470,14 @@ static int compatible_assignment_types(struct expression *expr, struct symbol *t
                return 0;
        }
 
+       if (target->ctype.as == 5 && source->ctype.as != 5) {
+               warning(expr->pos, "Tag mismatch in %s (%d)", where, expr->pos.line);
+               info(expr->pos, "   asn = %d", source->ctype.as);
+               info(expr->pos, "   asn = %d", target->ctype.as);
+               *rp = cast_to(*rp, target);
+               return 0;               
+       }
+
        return 1;
 }
 
This hack works and correctly identifies code that isn't using an __untagged
annotated variable when it should. (A untagged_addr macro is used to strip the
tag and add the __untagged annotation to the returned unsigned longs).  However,
it results in the propogation of __untagged annotations across the kernel up the
call chain - for example for the find_vma function: every kernel function which
calls a function which calls find_vma will have this annotation - this includes
functions that are no where near user-space (e.g. core memory management code).

I then attempted to use smatch, this seems appropiate as there are existing
check_xxx files which do similar tasks, e.g. check_spectre, check_uninitialized,
etc. My approach is to only annotate the functions that actually need the address
pointer to be untagged (rather than that, its caller, and their caller, etc) and
then rely on the smatch flow to identify if the address pointer is influenced by
userspace or not.

I first attempted this by hooking into expressions that use annotated symbols
(annotated by the function prototype arguments) and test to see if the expression
includes data provided by user. I seem to fail when testing for user provided
data - can smatch track user provided data that are base types (e.g. unsigned
long)? E.g.

void func2(int x)
{
	x++;
}

void func1(int x)
{
	x++;
	func2(x);
}

void start(void *b)
{
	void *a;
	copy_from_user(a, b, 10);
	func1(a[0]);
}

In the above, smatch seems to be able to identify the callsite for func1 as
having user data, however not for x in the x++ expressions. I guess smatch would
need to track each use of 'a' and propogate a 'userspace' state for each assignment,
e.g. to x. (Is smatch currently limited to tracking user data for pointers rather
than base types?).

Another approach could be to hook into expressions that use annotated symbols - 
but only for functions where there is a syscall or ioctl in any of the possible
call stacks - is this possible?

I'm very new to smatch/sparse so I would be grateful for any pointers (excuse the
pun) on how best to approach this. Also let me know if there is a more appropriate
place for this discussion.

FYI - An earlier discussion related to this can be found here [2].

[1] https://patchwork.kernel.org/patch/9691093/
[2] https://patchwork.kernel.org/patch/10581535/

Thanks,

Andrew Murray

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-05-29 18:47 Detecting user data on base types Andrew Murray
@ 2019-05-29 19:49 ` Dan Carpenter
  2019-05-30  9:03   ` Andrew Murray
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Carpenter @ 2019-05-29 19:49 UTC (permalink / raw)
  To: Andrew Murray; +Cc: smatch

On Wed, May 29, 2019 at 07:47:23PM +0100, Andrew Murray wrote:
> void func1(int x)
> {
> 	x++;
> 	func2(x);
> }
> 
> void start(void *b)
> {
> 	void *a;
> 	copy_from_user(a, b, 10);
> 	func1(a[0]);
> }

Unfortunately, Smatch is pretty crap at tracking array elements so it
doesn't track that a[0] is user controlled.

Otherwise it generally does track that an int is controlled by the user
and the user can pick a number between 0-10 or whatever.  Like maybe the
a=0-100 but only for in-kernel API and 0-10 can be set from the ioctl,
the two ranges are tracked separately.

> 
> In the above, smatch seems to be able to identify the callsite for func1 as
> having user data, however not for x in the x++ expressions. I guess smatch would
> need to track each use of 'a' and propogate a 'userspace' state for each assignment,
> e.g. to x. (Is smatch currently limited to tracking user data for pointers rather
> than base types?).

x++ is problematic for Smatch, also.  If we're outside of a loop then
it says 0-10 becomes 1-11, but if we're inside a loop then it becomes
1-s32max.

Another problem is that Smatch doesn't understand how the sign_extend64()
function works so it doesn't understand the untagged_addr() macro.  :/
I see one bug here and a missing feature...  Right now Smatch thinks
that the return value is totally unknown and not user controlled.  This
is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().

Anyway, let me send you some updated test code without the array.  Copy
it to the smatch/ directory and do:

./smatch --info -p=kernel test.c | tee warns.txt
./smatch_data/db/create_db.sh -p=kernel warns.txt
./smatch -p=kernel test.c

That maybe gives you a better idea of where we're at.

I guess my other question would be where would you want to print the
warning?

regards,
dan carpenter
--------------

#include "check_debug.h"

int copy_from_user(void *dest, void *src, long size);

#define __s64 signed long long
#define __u64 unsigned long long
#define u64 unsigned long long
#define __u8 unsigned char

/**
 * sign_extend64 - sign extend a 64-bit value using specified bit as sign-bit
 * @value: value to sign extend
 * @index: 0 based bit index (0<=index<64) to sign bit
 */
static inline __s64 sign_extend64(__u64 value, int index)
{
	__u8 shift = 63 - index;
	return (__s64)(value << shift) >> shift;
}

#define untagged_addr(addr) ((__typeof__(addr))sign_extend64((u64)(addr), 55))

void func1(unsigned long x)
{
	__smatch_user_rl(x);
}

void *p;
int main(void)
{
	unsigned long a;

	copy_from_user(&a, p, sizeof(a));
	__smatch_user_rl(a);
	__smatch_user_rl(untagged_addr(a));
	func1(a);
	func1(untagged_addr(a));

	return 0;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-05-29 19:49 ` Dan Carpenter
@ 2019-05-30  9:03   ` Andrew Murray
  2019-05-30 17:46     ` Dan Carpenter
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Murray @ 2019-05-30  9:03 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: smatch

Hi Dan,

Thanks for the quick and helpful response.

On Wed, May 29, 2019 at 10:49:33PM +0300, Dan Carpenter wrote:
> On Wed, May 29, 2019 at 07:47:23PM +0100, Andrew Murray wrote:
> > void func1(int x)
> > {
> > 	x++;
> > 	func2(x);
> > }
> > 
> > void start(void *b)
> > {
> > 	void *a;
> > 	copy_from_user(a, b, 10);
> > 	func1(a[0]);
> > }
> 
> Unfortunately, Smatch is pretty crap at tracking array elements so it
> doesn't track that a[0] is user controlled.
> 
> Otherwise it generally does track that an int is controlled by the user
> and the user can pick a number between 0-10 or whatever.  Like maybe the
> a=0-100 but only for in-kernel API and 0-10 can be set from the ioctl,
> the two ranges are tracked separately.
> 
> > 
> > In the above, smatch seems to be able to identify the callsite for func1 as
> > having user data, however not for x in the x++ expressions. I guess smatch would
> > need to track each use of 'a' and propogate a 'userspace' state for each assignment,
> > e.g. to x. (Is smatch currently limited to tracking user data for pointers rather
> > than base types?).
> 
> x++ is problematic for Smatch, also.  If we're outside of a loop then
> it says 0-10 becomes 1-11, but if we're inside a loop then it becomes
> 1-s32max.

OK thanks for the explanation.

> 
> Another problem is that Smatch doesn't understand how the sign_extend64()
> function works so it doesn't understand the untagged_addr() macro.  :/
> I see one bug here and a missing feature...  Right now Smatch thinks
> that the return value is totally unknown and not user controlled.  This
> is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().

Actually the ability to treat the return type as not user controlled makes
this a little simpler...

We want to flag a potential issue when a tagged address is used in a function
that wants untagged addresses (as marked by the address space annotation).
Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
this condition.

When a user correctly calls untagged_addr prior to calling these functions
the macro helpfully drops the user controlled flag which stops the condition
above triggering.

Is there a way of getting smatch to drop the 'user controlled' flag without
relying on a bug?

> 
> Anyway, let me send you some updated test code without the array.  Copy
> it to the smatch/ directory and do:
> 
> ./smatch --info -p=kernel test.c | tee warns.txt
> ./smatch_data/db/create_db.sh -p=kernel warns.txt
> ./smatch -p=kernel test.c
> 
> That maybe gives you a better idea of where we're at.

Thanks this was really helpful, I'm not sure I really understood how the
database was created prior to this.

> 
> I guess my other question would be where would you want to print the
> warning?

Thanks to your help I believe I have this working as follows:

diff --git a/check_list.h b/check_list.h
index b1d24c504ba5..f7551e7c5215 100644
--- a/check_list.h
+++ b/check_list.h
@@ -192,6 +192,7 @@ CK(check_nospec_barrier)
 CK(check_spectre)
 CK(check_spectre_second_half)
 CK(check_implicit_dependencies)
+CK(check_tagged)
 
 /* wine specific stuff */
 CK(check_wine_filehandles)
diff --git a/check_tagged.c b/check_tagged.c
new file mode 100644
index 000000000000..d14a81a6c33a
--- /dev/null
+++ b/check_tagged.c
@@ -0,0 +1,49 @@
+#include "smatch.h"
+#include "smatch_extra.h"
+
+static void untagged_check(struct expression *expr)
+{
+       char *name;
+
+       if (parse_error)
+               return;
+
+       if (is_impossible_path())
+               return;
+
+       if (expr->type == EXPR_PREOP)
+               return;
+
+       if (is_user_rl(expr)) {
+               if (expr->symbol && expr->symbol->ctype.as == 5) {
+                       name = expr_to_str(expr);
+                       sm_warning("potential tagged issue '%s'", name);
+                       free_string(name);
+               }
+       }
+}
+
+void check_tagged(int id)
+{
+       if (option_project != PROJ_KERNEL)
+               return;
+
+       add_hook(&untagged_check, EXPR_HOOK);
+}
 
This correctly identifies functions that have been annotated with the __untagged
annotation that use data from userspace. However, it's not actually that
useful...

For example, if I annotate the find_vma function, then it will flag this as a
potential tagged issue given that its called (from at least some contexts) with
userspace data. However find_vma is called all over the kernel, and so it's
difficult to figure out which caller called find_vma with user data. 

It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
'find_vma' target to look for all the callers where this is a leaf function, but
this doesn't track the call parameters to exclude paths where the leaf annotated
function contains user data.

I haven't fully understood var_user_rl - but it looks like functions such as
match_user_assign_function propogate the user_data tag on each run to update
state, and thus it's probably not easy to get the original source of user
data after is_user_rl determines its user. Is that correct?

I would be great if there was a way to obtain the function that provided the
original source of user data - then it would be possible to write a script that
shows the path (call chain) between the two functions - I hoped smatch_scripts/
call_tree.pl could do this but it doesn't seem to do anything for me.

Thanks,

Andrew Murray

> 
> regards,
> dan carpenter
> --------------
> 
> #include "check_debug.h"
> 
> int copy_from_user(void *dest, void *src, long size);
> 
> #define __s64 signed long long
> #define __u64 unsigned long long
> #define u64 unsigned long long
> #define __u8 unsigned char
> 
> /**
>  * sign_extend64 - sign extend a 64-bit value using specified bit as sign-bit
>  * @value: value to sign extend
>  * @index: 0 based bit index (0<=index<64) to sign bit
>  */
> static inline __s64 sign_extend64(__u64 value, int index)
> {
> 	__u8 shift = 63 - index;
> 	return (__s64)(value << shift) >> shift;
> }
> 
> #define untagged_addr(addr) ((__typeof__(addr))sign_extend64((u64)(addr), 55))
> 
> void func1(unsigned long x)
> {
> 	__smatch_user_rl(x);

It took more than a glance to realise what this was doing!

> }
> 
> void *p;
> int main(void)
> {
> 	unsigned long a;
> 
> 	copy_from_user(&a, p, sizeof(a));
> 	__smatch_user_rl(a);
> 	__smatch_user_rl(untagged_addr(a));
> 	func1(a);
> 	func1(untagged_addr(a));
> 
> 	return 0;
> }
> 

Thanks,

Andrew Murray

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-05-30  9:03   ` Andrew Murray
@ 2019-05-30 17:46     ` Dan Carpenter
  2019-06-05  8:29       ` Andrew Murray
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Carpenter @ 2019-05-30 17:46 UTC (permalink / raw)
  To: Andrew Murray; +Cc: smatch

On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > Another problem is that Smatch doesn't understand how the sign_extend64()
> > function works so it doesn't understand the untagged_addr() macro.  :/
> > I see one bug here and a missing feature...  Right now Smatch thinks
> > that the return value is totally unknown and not user controlled.  This
> > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> 
> Actually the ability to treat the return type as not user controlled makes
> this a little simpler...
> 
> We want to flag a potential issue when a tagged address is used in a function
> that wants untagged addresses (as marked by the address space annotation).
> Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> this condition.
> 
> When a user correctly calls untagged_addr prior to calling these functions
> the macro helpfully drops the user controlled flag which stops the condition
> above triggering.
> 
> Is there a way of getting smatch to drop the 'user controlled' flag without
> relying on a bug?

We know the high byte is either 0 or 0xff so we could do something like:

	struct symbol *type;
	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };

	type = get_type(expr);
	if (!type || type_bits(type) != 64)
		return;

	if (!get_user_rl(expr, &rl))
		return;
	if (rl_has_sval(rl, invalid))
		sm_warning("potential tagged issue '%s'", name);

I will fix the untagged_addr() handling very soon.

> Thanks to your help I believe I have this working as follows:
> 
> diff --git a/check_list.h b/check_list.h
> index b1d24c504ba5..f7551e7c5215 100644
> --- a/check_list.h
> +++ b/check_list.h
> @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
>  CK(check_spectre)
>  CK(check_spectre_second_half)
>  CK(check_implicit_dependencies)
> +CK(check_tagged)
>  
>  /* wine specific stuff */
>  CK(check_wine_filehandles)
> diff --git a/check_tagged.c b/check_tagged.c
> new file mode 100644
> index 000000000000..d14a81a6c33a
> --- /dev/null
> +++ b/check_tagged.c
> @@ -0,0 +1,49 @@
> +#include "smatch.h"
> +#include "smatch_extra.h"
> +
> +static void untagged_check(struct expression *expr)
> +{
> +       char *name;
> +
> +       if (parse_error)
> +               return;
> +
> +       if (is_impossible_path())
> +               return;
> +
> +       if (expr->type == EXPR_PREOP)
> +               return;
> +
> +       if (is_user_rl(expr)) {
> +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> +                       name = expr_to_str(expr);
> +                       sm_warning("potential tagged issue '%s'", name);
> +                       free_string(name);
> +               }
> +       }
> +}
> +
> +void check_tagged(int id)
> +{
> +       if (option_project != PROJ_KERNEL)
> +               return;
> +
> +       add_hook(&untagged_check, EXPR_HOOK);
> +}
>  
> This correctly identifies functions that have been annotated with the __untagged
> annotation that use data from userspace. However, it's not actually that
> useful...
> 
> For example, if I annotate the find_vma function, then it will flag this as a
> potential tagged issue given that its called (from at least some contexts) with
> userspace data. However find_vma is called all over the kernel, and so it's
> difficult to figure out which caller called find_vma with user data.
> 
> It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> 'find_vma' target to look for all the callers where this is a leaf function, but
> this doesn't track the call parameters to exclude paths where the leaf annotated
> function contains user data.

Huh...  I haven't thought about that script in years...  :/

What I do is I download the latest linux-next ever day and rebuild my DB
every day.  Each time you rebuild it, the call tree gets filled out
a bit.  After about a week then the DB is as complete as it is going to
get.

Then I use the smatch_data/db/smdb.py script to figure out the warnings.
I should add an option so that it only shows callers which pass user
data.  Each call site has a unique caller ID.


> I haven't fully understood var_user_rl - but it looks like functions such as
> match_user_assign_function propogate the user_data tag on each run to update
> state, and thus it's probably not easy to get the original source of user
> data after is_user_rl determines its user. Is that correct?

Use the get_user_rl() function instead.  The var_user_rl() is only
supposed to be used internally, to do math.

> 
> I would be great if there was a way to obtain the function that provided the
> original source of user data - then it would be possible to write a script that
> shows the path (call chain) between the two functions - I hoped smatch_scripts/
> call_tree.pl could do this but it doesn't seem to do anything for me.

That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
so I can do CTRL-c to see how a function is called or CTRL-r to see
what it returns

# map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
# map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-05-30 17:46     ` Dan Carpenter
@ 2019-06-05  8:29       ` Andrew Murray
  2019-06-05 11:47         ` Andrew Murray
                           ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andrew Murray @ 2019-06-05  8:29 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: smatch

On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:
> On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > > Another problem is that Smatch doesn't understand how the sign_extend64()
> > > function works so it doesn't understand the untagged_addr() macro.  :/
> > > I see one bug here and a missing feature...  Right now Smatch thinks
> > > that the return value is totally unknown and not user controlled.  This
> > > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> > 
> > Actually the ability to treat the return type as not user controlled makes
> > this a little simpler...
> > 
> > We want to flag a potential issue when a tagged address is used in a function
> > that wants untagged addresses (as marked by the address space annotation).
> > Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> > this condition.
> > 
> > When a user correctly calls untagged_addr prior to calling these functions
> > the macro helpfully drops the user controlled flag which stops the condition
> > above triggering.
> > 
> > Is there a way of getting smatch to drop the 'user controlled' flag without
> > relying on a bug?
> 
> We know the high byte is either 0 or 0xff so we could do something like:
> 
> 	struct symbol *type;
> 	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };
> 
> 	type = get_type(expr);
> 	if (!type || type_bits(type) != 64)
> 		return;
> 
> 	if (!get_user_rl(expr, &rl))
> 		return;
> 	if (rl_has_sval(rl, invalid))
> 		sm_warning("potential tagged issue '%s'", name);
> 
> I will fix the untagged_addr() handling very soon.

Thanks for this.

Though actually for MTE [1] the top byte may contain any value. But the above
could easily be updated.

> 
> > Thanks to your help I believe I have this working as follows:
> > 
> > diff --git a/check_list.h b/check_list.h
> > index b1d24c504ba5..f7551e7c5215 100644
> > --- a/check_list.h
> > +++ b/check_list.h
> > @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
> >  CK(check_spectre)
> >  CK(check_spectre_second_half)
> >  CK(check_implicit_dependencies)
> > +CK(check_tagged)
> >  
> >  /* wine specific stuff */
> >  CK(check_wine_filehandles)
> > diff --git a/check_tagged.c b/check_tagged.c
> > new file mode 100644
> > index 000000000000..d14a81a6c33a
> > --- /dev/null
> > +++ b/check_tagged.c
> > @@ -0,0 +1,49 @@
> > +#include "smatch.h"
> > +#include "smatch_extra.h"
> > +
> > +static void untagged_check(struct expression *expr)
> > +{
> > +       char *name;
> > +
> > +       if (parse_error)
> > +               return;
> > +
> > +       if (is_impossible_path())
> > +               return;
> > +
> > +       if (expr->type == EXPR_PREOP)
> > +               return;
> > +
> > +       if (is_user_rl(expr)) {
> > +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> > +                       name = expr_to_str(expr);
> > +                       sm_warning("potential tagged issue '%s'", name);
> > +                       free_string(name);
> > +               }
> > +       }
> > +}
> > +
> > +void check_tagged(int id)
> > +{
> > +       if (option_project != PROJ_KERNEL)
> > +               return;
> > +
> > +       add_hook(&untagged_check, EXPR_HOOK);
> > +}
> >  
> > This correctly identifies functions that have been annotated with the __untagged
> > annotation that use data from userspace. However, it's not actually that
> > useful...
> > 
> > For example, if I annotate the find_vma function, then it will flag this as a
> > potential tagged issue given that its called (from at least some contexts) with
> > userspace data. However find_vma is called all over the kernel, and so it's
> > difficult to figure out which caller called find_vma with user data.
> > 
> > It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> > 'find_vma' target to look for all the callers where this is a leaf function, but
> > this doesn't track the call parameters to exclude paths where the leaf annotated
> > function contains user data.
> 
> Huh...  I haven't thought about that script in years...  :/
> 
> What I do is I download the latest linux-next ever day and rebuild my DB
> every day.  Each time you rebuild it, the call tree gets filled out
> a bit.  After about a week then the DB is as complete as it is going to
> get.
> 
> Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> I should add an option so that it only shows callers which pass user
> data.  Each call site has a unique caller ID.

Thanks - I hadn't looked at that script, but looks very useful.

By the way with the following hunk you can, for a given function, which call sites
pass user data.

@@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
     print_caller_info(filename, func)
 elif sys.argv[1] == "user_data":
     func = sys.argv[2]
+    filename = sys.argv[3]
     print_caller_info(filename, func, "USER_DATA")
 elif sys.argv[1] == "param_value":
     func = sys.argv[2]

Though I've found it's quite useful to manually type commands in the sqlite3 CLI.

> 
> 
> > I haven't fully understood var_user_rl - but it looks like functions such as
> > match_user_assign_function propogate the user_data tag on each run to update
> > state, and thus it's probably not easy to get the original source of user
> > data after is_user_rl determines its user. Is that correct?
> 
> Use the get_user_rl() function instead.  The var_user_rl() is only
> supposed to be used internally, to do math.

OK.

> 
> > 
> > I would be great if there was a way to obtain the function that provided the
> > original source of user data - then it would be possible to write a script that
> > shows the path (call chain) between the two functions - I hoped smatch_scripts/
> > call_tree.pl could do this but it doesn't seem to do anything for me.
> 
> That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
> so I can do CTRL-c to see how a function is called or CTRL-r to see
> what it returns
> 
> # map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> # map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> 

I've spent some time playing with this - thanks.

Through this discussion I'm able to detect when annotated function parameters contain
user provided values. The challenge for me is to detect where that data originated
from (i.e. following the parameter up the call tree) to ease debugging.

My first attempt didn't trace the parameters and just looked at the call tree for any
functions which provided user data, however this resulted in false positives (e.g.
just because a function higher up in the call stack passed user data, it doesn't mean
it was this data that made it to the target function).

I've made some progress by adapting the trace_param feature of smdb.py - I've modified
it such that for a given symbol (e.g. find_vma), it will trace the parameters up the
call tree, if it sees a function where the param is user data (8017) then it prints the
symbol. This seems to work, but it doesn't seem to catch everything it should.

For example, find_vma is called by apply_vma_lock_flags which is called by do_mlock,
probing the database:

sqlite> select * from caller_info where function='find_vma' and (parameter = -1 or parameter = 1) and caller='apply_vma_lock_flags';         
file|caller|function|call_id|static|type|parameter|key|value
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|0|-1||struct vm_area_struct*(*)(struct mm_struct*, ulong)
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1004|1|$|1
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1014|1|$|p 0
mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1001|1|$|0,4096-18446744073709547520

The third entry indicates that find_vma was called by apply_vma_lock_flags and find_vma(arg 1) == apply_vma_lock_flags(arg 0),
this is correct, moving up the call tree:

sqlite> select * from caller_info where function='apply_vma_lock_flags' and (parameter = -1 or parameter = 0) and caller='do_mlock';
file|caller|function|call_id|static|type|parameter|key|value
mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|0|-1||int(*)(ulong, ulong, ulong)
mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1004|0|$|1
mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1001|0|$|0,4096-18446744073709547520

The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
our tracing of params stops early. Do you have any clue why this may be?

I've rebuilt the database in a loop with ~/smatch/smatch_scripts/build_kernel_data.sh.

Thanks,

Andrew Murray

[1] https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM-2018.pdf

> regards,
> dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-05  8:29       ` Andrew Murray
@ 2019-06-05 11:47         ` Andrew Murray
  2019-06-05 12:28           ` Andrew Murray
  2019-06-06  9:40         ` Dan Carpenter
  2019-06-06 14:39         ` Dan Carpenter
  2 siblings, 1 reply; 13+ messages in thread
From: Andrew Murray @ 2019-06-05 11:47 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: smatch

On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:
> > On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > > > Another problem is that Smatch doesn't understand how the sign_extend64()
> > > > function works so it doesn't understand the untagged_addr() macro.  :/
> > > > I see one bug here and a missing feature...  Right now Smatch thinks
> > > > that the return value is totally unknown and not user controlled.  This
> > > > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> > > 
> > > Actually the ability to treat the return type as not user controlled makes
> > > this a little simpler...
> > > 
> > > We want to flag a potential issue when a tagged address is used in a function
> > > that wants untagged addresses (as marked by the address space annotation).
> > > Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> > > this condition.
> > > 
> > > When a user correctly calls untagged_addr prior to calling these functions
> > > the macro helpfully drops the user controlled flag which stops the condition
> > > above triggering.
> > > 
> > > Is there a way of getting smatch to drop the 'user controlled' flag without
> > > relying on a bug?
> > 
> > We know the high byte is either 0 or 0xff so we could do something like:
> > 
> > 	struct symbol *type;
> > 	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };
> > 
> > 	type = get_type(expr);
> > 	if (!type || type_bits(type) != 64)
> > 		return;
> > 
> > 	if (!get_user_rl(expr, &rl))
> > 		return;
> > 	if (rl_has_sval(rl, invalid))
> > 		sm_warning("potential tagged issue '%s'", name);
> > 
> > I will fix the untagged_addr() handling very soon.
> 
> Thanks for this.
> 
> Though actually for MTE [1] the top byte may contain any value. But the above
> could easily be updated.
> 
> > 
> > > Thanks to your help I believe I have this working as follows:
> > > 
> > > diff --git a/check_list.h b/check_list.h
> > > index b1d24c504ba5..f7551e7c5215 100644
> > > --- a/check_list.h
> > > +++ b/check_list.h
> > > @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
> > >  CK(check_spectre)
> > >  CK(check_spectre_second_half)
> > >  CK(check_implicit_dependencies)
> > > +CK(check_tagged)
> > >  
> > >  /* wine specific stuff */
> > >  CK(check_wine_filehandles)
> > > diff --git a/check_tagged.c b/check_tagged.c
> > > new file mode 100644
> > > index 000000000000..d14a81a6c33a
> > > --- /dev/null
> > > +++ b/check_tagged.c
> > > @@ -0,0 +1,49 @@
> > > +#include "smatch.h"
> > > +#include "smatch_extra.h"
> > > +
> > > +static void untagged_check(struct expression *expr)
> > > +{
> > > +       char *name;
> > > +
> > > +       if (parse_error)
> > > +               return;
> > > +
> > > +       if (is_impossible_path())
> > > +               return;
> > > +
> > > +       if (expr->type == EXPR_PREOP)
> > > +               return;
> > > +
> > > +       if (is_user_rl(expr)) {
> > > +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> > > +                       name = expr_to_str(expr);
> > > +                       sm_warning("potential tagged issue '%s'", name);
> > > +                       free_string(name);
> > > +               }
> > > +       }
> > > +}
> > > +
> > > +void check_tagged(int id)
> > > +{
> > > +       if (option_project != PROJ_KERNEL)
> > > +               return;
> > > +
> > > +       add_hook(&untagged_check, EXPR_HOOK);
> > > +}
> > >  
> > > This correctly identifies functions that have been annotated with the __untagged
> > > annotation that use data from userspace. However, it's not actually that
> > > useful...
> > > 
> > > For example, if I annotate the find_vma function, then it will flag this as a
> > > potential tagged issue given that its called (from at least some contexts) with
> > > userspace data. However find_vma is called all over the kernel, and so it's
> > > difficult to figure out which caller called find_vma with user data.
> > > 
> > > It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> > > 'find_vma' target to look for all the callers where this is a leaf function, but
> > > this doesn't track the call parameters to exclude paths where the leaf annotated
> > > function contains user data.
> > 
> > Huh...  I haven't thought about that script in years...  :/
> > 
> > What I do is I download the latest linux-next ever day and rebuild my DB
> > every day.  Each time you rebuild it, the call tree gets filled out
> > a bit.  After about a week then the DB is as complete as it is going to
> > get.
> > 
> > Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> > I should add an option so that it only shows callers which pass user
> > data.  Each call site has a unique caller ID.
> 
> Thanks - I hadn't looked at that script, but looks very useful.
> 
> By the way with the following hunk you can, for a given function, which call sites
> pass user data.
> 
> @@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
>      print_caller_info(filename, func)
>  elif sys.argv[1] == "user_data":
>      func = sys.argv[2]
> +    filename = sys.argv[3]
>      print_caller_info(filename, func, "USER_DATA")
>  elif sys.argv[1] == "param_value":
>      func = sys.argv[2]
> 
> Though I've found it's quite useful to manually type commands in the sqlite3 CLI.
> 
> > 
> > 
> > > I haven't fully understood var_user_rl - but it looks like functions such as
> > > match_user_assign_function propogate the user_data tag on each run to update
> > > state, and thus it's probably not easy to get the original source of user
> > > data after is_user_rl determines its user. Is that correct?
> > 
> > Use the get_user_rl() function instead.  The var_user_rl() is only
> > supposed to be used internally, to do math.
> 
> OK.
> 
> > 
> > > 
> > > I would be great if there was a way to obtain the function that provided the
> > > original source of user data - then it would be possible to write a script that
> > > shows the path (call chain) between the two functions - I hoped smatch_scripts/
> > > call_tree.pl could do this but it doesn't seem to do anything for me.
> > 
> > That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
> > so I can do CTRL-c to see how a function is called or CTRL-r to see
> > what it returns
> > 
> > # map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > # map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > 
> 
> I've spent some time playing with this - thanks.
> 
> Through this discussion I'm able to detect when annotated function parameters contain
> user provided values. The challenge for me is to detect where that data originated
> from (i.e. following the parameter up the call tree) to ease debugging.
> 
> My first attempt didn't trace the parameters and just looked at the call tree for any
> functions which provided user data, however this resulted in false positives (e.g.
> just because a function higher up in the call stack passed user data, it doesn't mean
> it was this data that made it to the target function).
> 
> I've made some progress by adapting the trace_param feature of smdb.py - I've modified
> it such that for a given symbol (e.g. find_vma), it will trace the parameters up the
> call tree, if it sees a function where the param is user data (8017) then it prints the
> symbol. This seems to work, but it doesn't seem to catch everything it should.
> 
> For example, find_vma is called by apply_vma_lock_flags which is called by do_mlock,
> probing the database:
> 
> sqlite> select * from caller_info where function='find_vma' and (parameter = -1 or parameter = 1) and caller='apply_vma_lock_flags';         
> file|caller|function|call_id|static|type|parameter|key|value
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|0|-1||struct vm_area_struct*(*)(struct mm_struct*, ulong)
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1004|1|$|1
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1014|1|$|p 0
> mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1001|1|$|0,4096-18446744073709547520
> 
> The third entry indicates that find_vma was called by apply_vma_lock_flags and find_vma(arg 1) == apply_vma_lock_flags(arg 0),
> this is correct, moving up the call tree:
> 
> sqlite> select * from caller_info where function='apply_vma_lock_flags' and (parameter = -1 or parameter = 0) and caller='do_mlock';
> file|caller|function|call_id|static|type|parameter|key|value
> mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|0|-1||int(*)(ulong, ulong, ulong)
> mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1004|0|$|1
> mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1001|0|$|0,4096-18446744073709547520
> 
> The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
> apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
> our tracing of params stops early. Do you have any clue why this may be?

It seems that this one is due to the 'param_was_set' check in smatch_data_source.c,
removing this check overcomes this issue. Is this check necessary? Or perhaps this
should have returned true?

Also should "get_user" be added to smatch_kernel_user_data.c in addition to
copy_from_user?

Thanks,

Andrew Murray

> 
> I've rebuilt the database in a loop with ~/smatch/smatch_scripts/build_kernel_data.sh.
> 
> Thanks,
> 
> Andrew Murray
> 
> [1] https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM-2018.pdf
> 
> > regards,
> > dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-05 11:47         ` Andrew Murray
@ 2019-06-05 12:28           ` Andrew Murray
  2019-06-06 10:15             ` Dan Carpenter
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Murray @ 2019-06-05 12:28 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: smatch

On Wed, Jun 05, 2019 at 12:47:38PM +0100, Andrew Murray wrote:
> On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> > On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:
> > > On Thu, May 30, 2019 at 10:03:30AM +0100, Andrew Murray wrote:
> > > > > Another problem is that Smatch doesn't understand how the sign_extend64()
> > > > > function works so it doesn't understand the untagged_addr() macro.  :/
> > > > > I see one bug here and a missing feature...  Right now Smatch thinks
> > > > > that the return value is totally unknown and not user controlled.  This
> > > > > is a fixable issue by implementing SPECIAL_LEFTSHIFT in rl_binop().
> > > > 
> > > > Actually the ability to treat the return type as not user controlled makes
> > > > this a little simpler...
> > > > 
> > > > We want to flag a potential issue when a tagged address is used in a function
> > > > that wants untagged addresses (as marked by the address space annotation).
> > > > Therefore we can simply test for "ctype.as == 5 && is_user_rl(expr)" to meet
> > > > this condition.
> > > > 
> > > > When a user correctly calls untagged_addr prior to calling these functions
> > > > the macro helpfully drops the user controlled flag which stops the condition
> > > > above triggering.
> > > > 
> > > > Is there a way of getting smatch to drop the 'user controlled' flag without
> > > > relying on a bug?
> > > 
> > > We know the high byte is either 0 or 0xff so we could do something like:
> > > 
> > > 	struct symbol *type;
> > > 	sval_t invalid = { .type = &ullong_ctype, .vaule = 1ULL << 62 };
> > > 
> > > 	type = get_type(expr);
> > > 	if (!type || type_bits(type) != 64)
> > > 		return;
> > > 
> > > 	if (!get_user_rl(expr, &rl))
> > > 		return;
> > > 	if (rl_has_sval(rl, invalid))
> > > 		sm_warning("potential tagged issue '%s'", name);
> > > 
> > > I will fix the untagged_addr() handling very soon.
> > 
> > Thanks for this.
> > 
> > Though actually for MTE [1] the top byte may contain any value. But the above
> > could easily be updated.
> > 
> > > 
> > > > Thanks to your help I believe I have this working as follows:
> > > > 
> > > > diff --git a/check_list.h b/check_list.h
> > > > index b1d24c504ba5..f7551e7c5215 100644
> > > > --- a/check_list.h
> > > > +++ b/check_list.h
> > > > @@ -192,6 +192,7 @@ CK(check_nospec_barrier)
> > > >  CK(check_spectre)
> > > >  CK(check_spectre_second_half)
> > > >  CK(check_implicit_dependencies)
> > > > +CK(check_tagged)
> > > >  
> > > >  /* wine specific stuff */
> > > >  CK(check_wine_filehandles)
> > > > diff --git a/check_tagged.c b/check_tagged.c
> > > > new file mode 100644
> > > > index 000000000000..d14a81a6c33a
> > > > --- /dev/null
> > > > +++ b/check_tagged.c
> > > > @@ -0,0 +1,49 @@
> > > > +#include "smatch.h"
> > > > +#include "smatch_extra.h"
> > > > +
> > > > +static void untagged_check(struct expression *expr)
> > > > +{
> > > > +       char *name;
> > > > +
> > > > +       if (parse_error)
> > > > +               return;
> > > > +
> > > > +       if (is_impossible_path())
> > > > +               return;
> > > > +
> > > > +       if (expr->type == EXPR_PREOP)
> > > > +               return;
> > > > +
> > > > +       if (is_user_rl(expr)) {
> > > > +               if (expr->symbol && expr->symbol->ctype.as == 5) {
> > > > +                       name = expr_to_str(expr);
> > > > +                       sm_warning("potential tagged issue '%s'", name);
> > > > +                       free_string(name);
> > > > +               }
> > > > +       }
> > > > +}
> > > > +
> > > > +void check_tagged(int id)
> > > > +{
> > > > +       if (option_project != PROJ_KERNEL)
> > > > +               return;
> > > > +
> > > > +       add_hook(&untagged_check, EXPR_HOOK);
> > > > +}
> > > >  
> > > > This correctly identifies functions that have been annotated with the __untagged
> > > > annotation that use data from userspace. However, it's not actually that
> > > > useful...
> > > > 
> > > > For example, if I annotate the find_vma function, then it will flag this as a
> > > > potential tagged issue given that its called (from at least some contexts) with
> > > > userspace data. However find_vma is called all over the kernel, and so it's
> > > > difficult to figure out which caller called find_vma with user data.
> > > > 
> > > > It's possible to use ./smatch_scripts/unlocked_paths.pl with a garbage lock and
> > > > 'find_vma' target to look for all the callers where this is a leaf function, but
> > > > this doesn't track the call parameters to exclude paths where the leaf annotated
> > > > function contains user data.
> > > 
> > > Huh...  I haven't thought about that script in years...  :/
> > > 
> > > What I do is I download the latest linux-next ever day and rebuild my DB
> > > every day.  Each time you rebuild it, the call tree gets filled out
> > > a bit.  After about a week then the DB is as complete as it is going to
> > > get.
> > > 
> > > Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> > > I should add an option so that it only shows callers which pass user
> > > data.  Each call site has a unique caller ID.
> > 
> > Thanks - I hadn't looked at that script, but looks very useful.
> > 
> > By the way with the following hunk you can, for a given function, which call sites
> > pass user data.
> > 
> > @@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
> >      print_caller_info(filename, func)
> >  elif sys.argv[1] == "user_data":
> >      func = sys.argv[2]
> > +    filename = sys.argv[3]
> >      print_caller_info(filename, func, "USER_DATA")
> >  elif sys.argv[1] == "param_value":
> >      func = sys.argv[2]
> > 
> > Though I've found it's quite useful to manually type commands in the sqlite3 CLI.
> > 
> > > 
> > > 
> > > > I haven't fully understood var_user_rl - but it looks like functions such as
> > > > match_user_assign_function propogate the user_data tag on each run to update
> > > > state, and thus it's probably not easy to get the original source of user
> > > > data after is_user_rl determines its user. Is that correct?
> > > 
> > > Use the get_user_rl() function instead.  The var_user_rl() is only
> > > supposed to be used internally, to do math.
> > 
> > OK.
> > 
> > > 
> > > > 
> > > > I would be great if there was a way to obtain the function that provided the
> > > > original source of user data - then it would be possible to write a script that
> > > > shows the path (call chain) between the two functions - I hoped smatch_scripts/
> > > > call_tree.pl could do this but it doesn't seem to do anything for me.
> > > 
> > > That's the smatch_data/db/smdb.py script.  I have it as an alias in vim
> > > so I can do CTRL-c to see how a function is called or CTRL-r to see
> > > what it returns
> > > 
> > > # map <C-r> :! vim_smdb return_states <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > > # map <C-c> :! vim_smdb <cword> <CR> :execute 'edit' system("cat ~/.smdb_tmp/cur") <CR>
> > > 
> > 
> > I've spent some time playing with this - thanks.
> > 
> > Through this discussion I'm able to detect when annotated function parameters contain
> > user provided values. The challenge for me is to detect where that data originated
> > from (i.e. following the parameter up the call tree) to ease debugging.
> > 
> > My first attempt didn't trace the parameters and just looked at the call tree for any
> > functions which provided user data, however this resulted in false positives (e.g.
> > just because a function higher up in the call stack passed user data, it doesn't mean
> > it was this data that made it to the target function).
> > 
> > I've made some progress by adapting the trace_param feature of smdb.py - I've modified
> > it such that for a given symbol (e.g. find_vma), it will trace the parameters up the
> > call tree, if it sees a function where the param is user data (8017) then it prints the
> > symbol. This seems to work, but it doesn't seem to catch everything it should.
> > 
> > For example, find_vma is called by apply_vma_lock_flags which is called by do_mlock,
> > probing the database:
> > 
> > sqlite> select * from caller_info where function='find_vma' and (parameter = -1 or parameter = 1) and caller='apply_vma_lock_flags';         
> > file|caller|function|call_id|static|type|parameter|key|value
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|0|-1||struct vm_area_struct*(*)(struct mm_struct*, ulong)
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1004|1|$|1
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1014|1|$|p 0
> > mm/mlock.c|apply_vma_lock_flags|find_vma|598093|0|1001|1|$|0,4096-18446744073709547520
> > 
> > The third entry indicates that find_vma was called by apply_vma_lock_flags and find_vma(arg 1) == apply_vma_lock_flags(arg 0),
> > this is correct, moving up the call tree:
> > 
> > sqlite> select * from caller_info where function='apply_vma_lock_flags' and (parameter = -1 or parameter = 0) and caller='do_mlock';
> > file|caller|function|call_id|static|type|parameter|key|value
> > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|0|-1||int(*)(ulong, ulong, ulong)
> > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1004|0|$|1
> > mm/mlock.c|do_mlock|apply_vma_lock_flags|598106|1|1001|0|$|0,4096-18446744073709547520
> > 
> > The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
> > apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
> > our tracing of params stops early. Do you have any clue why this may be?
> 
> It seems that this one is due to the 'param_was_set' check in smatch_data_source.c,
> removing this check overcomes this issue. Is this check necessary? Or perhaps this
> should have returned true?
> 
> Also should "get_user" be added to smatch_kernel_user_data.c in addition to
> copy_from_user?

Actually it looks there is logic associated to get_user in handle_get_user within
smatch_kernel_user_data.c - however this doesn't seem to trigger in the following (it looks
like it does something with the return value of the macro, but this is just an error flag):

do_pages_move -calls-> add_page_for_migration - the second argument of
add_page_for_migration 'addr' comes from get_user - but it isn't marked as USER_DATA in
the database. I don't understand why this is.

Thanks,

Andrew Murray

> 
> Thanks,
> 
> Andrew Murray
> 
> > 
> > I've rebuilt the database in a loop with ~/smatch/smatch_scripts/build_kernel_data.sh.
> > 
> > Thanks,
> > 
> > Andrew Murray
> > 
> > [1] https://llvm.org/devmtg/2018-10/slides/Serebryany-Stepanov-Tsyrklevich-Memory-Tagging-Slides-LLVM-2018.pdf
> > 
> > > regards,
> > > dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-05  8:29       ` Andrew Murray
  2019-06-05 11:47         ` Andrew Murray
@ 2019-06-06  9:40         ` Dan Carpenter
  2019-06-13  9:15           ` Andrew Murray
  2019-06-06 14:39         ` Dan Carpenter
  2 siblings, 1 reply; 13+ messages in thread
From: Dan Carpenter @ 2019-06-06  9:40 UTC (permalink / raw)
  To: Andrew Murray; +Cc: smatch

On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:

> > Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> > I should add an option so that it only shows callers which pass user
> > data.  Each call site has a unique caller ID.
> 
> Thanks - I hadn't looked at that script, but looks very useful.
> 
> By the way with the following hunk you can, for a given function, which call sites
> pass user data.
> 
> @@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
>      print_caller_info(filename, func)
>  elif sys.argv[1] == "user_data":
>      func = sys.argv[2]
> +    filename = sys.argv[3]
>      print_caller_info(filename, func, "USER_DATA")
>  elif sys.argv[1] == "param_value":
>      func = sys.argv[2]
> 

Could you send me a normal patch with a Signed-off-by and I will apply
it?  Otherwise I can handle it if you want.


> The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
> apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
> our tracing of params stops early. Do you have any clue why this may be?

It's because the start variable gets modified in do_mlock():

mm/mlock.c
   671  static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
                                         ^^^^^^^^^^^^^^^^^^^

   672  {
   673          unsigned long locked;
   674          unsigned long lock_limit;
   675          int error = -ENOMEM;
   676  
   677          if (!can_do_mlock())
   678                  return -EPERM;
   679  
   680          len = PAGE_ALIGN(len + (offset_in_page(start)));
   681          start &= PAGE_MASK;
                ^^^^^^^^^^^^^^^^^^
Modified here.

I could change this behavior if you wanted.  I'd record the source and
add an "[m]" or something to say that it had been modified in that
function...  I'm tempted to change the "p 0" to "$0" but I would leave
the "r function" format the same for now.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-05 12:28           ` Andrew Murray
@ 2019-06-06 10:15             ` Dan Carpenter
  2019-06-13 10:41               ` Andrew Murray
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Carpenter @ 2019-06-06 10:15 UTC (permalink / raw)
  To: Andrew Murray; +Cc: smatch

On Wed, Jun 05, 2019 at 01:28:41PM +0100, Andrew Murray wrote:
> Actually it looks there is logic associated to get_user in handle_get_user within
> smatch_kernel_user_data.c - however this doesn't seem to trigger in the following (it looks
> like it does something with the return value of the macro, but this is just an error flag):
> 
> do_pages_move -calls-> add_page_for_migration - the second argument of
> add_page_for_migration 'addr' comes from get_user - but it isn't marked as USER_DATA in
> the database. I don't understand why this is.
> 

That's weird.  It's working for me.

        mm/migrate.c |        do_pages_move | add_page_for_migration |   USER_DATA |  1 | addr | 0-u64max

Btw, I pushed the fix for untagged_addr() and my DB was built with that
fix.  You could try see if that makes a difference, I guess?

kchecker --info mm/migrate.c > info.txt
~/path/to/smatch/smatch_data/db/reload_partial.sh info.txt

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-05  8:29       ` Andrew Murray
  2019-06-05 11:47         ` Andrew Murray
  2019-06-06  9:40         ` Dan Carpenter
@ 2019-06-06 14:39         ` Dan Carpenter
  2019-06-13 10:40           ` Andrew Murray
  2 siblings, 1 reply; 13+ messages in thread
From: Dan Carpenter @ 2019-06-06 14:39 UTC (permalink / raw)
  To: Andrew Murray; +Cc: smatch

On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> Through this discussion I'm able to detect when annotated function parameters contain
> user provided values. The challenge for me is to detect where that data originated
> from (i.e. following the parameter up the call tree) to ease debugging.
> 
> My first attempt didn't trace the parameters and just looked at the call tree for any
> functions which provided user data, however this resulted in false positives (e.g.
> just because a function higher up in the call stack passed user data, it doesn't mean
> it was this data that made it to the target function).

One of the main causes of this is function pointers that take a void
pointer argument.  For example, iblock_execute_sync_cache() takes a
user controlled "cmd" struct and says "bio->bi_private = cmd;"
Then in floppy_rb0_cb() we do:

	struct rb0_cbdata *cbdata = (struct rb0_cbdata *)bio->bi_private;

And smatch says that *cbdata is entirely user controlled...  The nice
fix for this would be if the mtag code were implimented and we could
tie function pointers to their data pointers very accurately.
Unfortunately, that's a pretty huge project and it's going to take a
while to complete.

A quicker fix is to add a line to smatch_data/db/fixup_kernel.sh

delete from caller_info where function = '(struct bio)->bi_end_io' and type = 8017;

which just deletes the user data from caller_info.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-06  9:40         ` Dan Carpenter
@ 2019-06-13  9:15           ` Andrew Murray
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Murray @ 2019-06-13  9:15 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: smatch

On Thu, Jun 06, 2019 at 12:40:04PM +0300, Dan Carpenter wrote:
> On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> > On Thu, May 30, 2019 at 08:46:42PM +0300, Dan Carpenter wrote:
> 
> > > Then I use the smatch_data/db/smdb.py script to figure out the warnings.
> > > I should add an option so that it only shows callers which pass user
> > > data.  Each call site has a unique caller ID.
> > 
> > Thanks - I hadn't looked at that script, but looks very useful.
> > 
> > By the way with the following hunk you can, for a given function, which call sites
> > pass user data.
> > 
> > @@ -614,6 +642,7 @@ elif sys.argv[1] == "call_info":
> >      print_caller_info(filename, func)
> >  elif sys.argv[1] == "user_data":
> >      func = sys.argv[2]
> > +    filename = sys.argv[3]
> >      print_caller_info(filename, func, "USER_DATA")
> >  elif sys.argv[1] == "param_value":
> >      func = sys.argv[2]
> > 
> 
> Could you send me a normal patch with a Signed-off-by and I will apply
> it?  Otherwise I can handle it if you want.

Yes I will do this, though waiting for legal approval on contributing here first.

> 
> 
> > The database doesn't know that when do_mlock calls apply_vma_lock_flags the first argument of
> > apply_vma_lock_flags is the first argument of do_mlock. There is no data source associated and
> > our tracing of params stops early. Do you have any clue why this may be?
> 
> It's because the start variable gets modified in do_mlock():
> 
> mm/mlock.c
>    671  static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
>                                          ^^^^^^^^^^^^^^^^^^^
> 
>    672  {
>    673          unsigned long locked;
>    674          unsigned long lock_limit;
>    675          int error = -ENOMEM;
>    676  
>    677          if (!can_do_mlock())
>    678                  return -EPERM;
>    679  
>    680          len = PAGE_ALIGN(len + (offset_in_page(start)));
>    681          start &= PAGE_MASK;
>                 ^^^^^^^^^^^^^^^^^^
> Modified here.
> 
> I could change this behavior if you wanted.  I'd record the source and
> add an "[m]" or something to say that it had been modified in that
> function...  I'm tempted to change the "p 0" to "$0" but I would leave
> the "r function" format the same for now.

I think it would be helpful to record the source with an [m] modifier - I'm
sure there may be other users for this in the future. I'd be grateful if
you did that.

What is the difference between "p 0" and "$0" as the value? And the r function?

Also how would this work where a value is derived from multiple sources, e.g:

func2(int start, end)
{
}

func1(int addr, int len)
{
	func2(addr, addr + len)
}

The 'end' parameter in func2 is derived from both arguments 'addr' and 'len'
of func1. Perhaps it would be expressed as an additional line in the datbase?

|func1|func2|||8017|0|$|p0
|func1|func2|||8017|1|$|p0
|func1|func2|||8017|1|$|p1

I guess this changes the path into a tree.

Thanks,

Andrew Murray

> 
> regards,
> dan carpenter
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-06 14:39         ` Dan Carpenter
@ 2019-06-13 10:40           ` Andrew Murray
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Murray @ 2019-06-13 10:40 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: smatch

On Thu, Jun 06, 2019 at 05:39:31PM +0300, Dan Carpenter wrote:
> On Wed, Jun 05, 2019 at 09:29:26AM +0100, Andrew Murray wrote:
> > Through this discussion I'm able to detect when annotated function parameters contain
> > user provided values. The challenge for me is to detect where that data originated
> > from (i.e. following the parameter up the call tree) to ease debugging.
> > 
> > My first attempt didn't trace the parameters and just looked at the call tree for any
> > functions which provided user data, however this resulted in false positives (e.g.
> > just because a function higher up in the call stack passed user data, it doesn't mean
> > it was this data that made it to the target function).
> 
> One of the main causes of this is function pointers that take a void
> pointer argument.  For example, iblock_execute_sync_cache() takes a
> user controlled "cmd" struct and says "bio->bi_private = cmd;"
> Then in floppy_rb0_cb() we do:
> 
> 	struct rb0_cbdata *cbdata = (struct rb0_cbdata *)bio->bi_private;
> 
> And smatch says that *cbdata is entirely user controlled...  The nice
> fix for this would be if the mtag code were implimented and we could
> tie function pointers to their data pointers very accurately.
> Unfortunately, that's a pretty huge project and it's going to take a
> while to complete.
> 
> A quicker fix is to add a line to smatch_data/db/fixup_kernel.sh
> 
> delete from caller_info where function = '(struct bio)->bi_end_io' and type = 8017;
> 
> which just deletes the user data from caller_info.

Ah yes I understand this. Though really what I was referring to is:

funca(int __untagged x, int dontcare){
}

funcb(int __user y){
	funca(y, 0);
}

funcc(int z, int __user g){
	funca(z, g);
}

funcd(){
	funca(22, 0);
}

Without relying on parameter tracing, a calltree that has been filtered of
calls without USER_DATA will look this:

funca
 - funcb
 - funcc

We don't want the false positive of funcc being in this tree - it's here
because funcc calls funca with user data - its a parameter we don't care about.

However so long as we trace parameters (even if they are modified) through
the call tree, then it's possible to update the scripts to show a call stack
that is filtered for userdata for only parameters of interest.

Thanks,

Andrew Murray 

> 
> regards,
> dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Detecting user data on base types
  2019-06-06 10:15             ` Dan Carpenter
@ 2019-06-13 10:41               ` Andrew Murray
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Murray @ 2019-06-13 10:41 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: smatch

On Thu, Jun 06, 2019 at 01:15:09PM +0300, Dan Carpenter wrote:
> On Wed, Jun 05, 2019 at 01:28:41PM +0100, Andrew Murray wrote:
> > Actually it looks there is logic associated to get_user in handle_get_user within
> > smatch_kernel_user_data.c - however this doesn't seem to trigger in the following (it looks
> > like it does something with the return value of the macro, but this is just an error flag):
> > 
> > do_pages_move -calls-> add_page_for_migration - the second argument of
> > add_page_for_migration 'addr' comes from get_user - but it isn't marked as USER_DATA in
> > the database. I don't understand why this is.
> > 
> 
> That's weird.  It's working for me.
> 
>         mm/migrate.c |        do_pages_move | add_page_for_migration |   USER_DATA |  1 | addr | 0-u64max
> 
> Btw, I pushed the fix for untagged_addr() and my DB was built with that
> fix.  You could try see if that makes a difference, I guess?
> 
> kchecker --info mm/migrate.c > info.txt
> ~/path/to/smatch/smatch_data/db/reload_partial.sh info.txt

I still don't see it. I'll investigate further.

(I'm running the tools on x86_64, cross building for ARM64 - I'll share some patches in
due course for cross-compilation).

Thanks,

Andrew Murray

> 
> regards,
> dan carpenter

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-06-13 10:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29 18:47 Detecting user data on base types Andrew Murray
2019-05-29 19:49 ` Dan Carpenter
2019-05-30  9:03   ` Andrew Murray
2019-05-30 17:46     ` Dan Carpenter
2019-06-05  8:29       ` Andrew Murray
2019-06-05 11:47         ` Andrew Murray
2019-06-05 12:28           ` Andrew Murray
2019-06-06 10:15             ` Dan Carpenter
2019-06-13 10:41               ` Andrew Murray
2019-06-06  9:40         ` Dan Carpenter
2019-06-13  9:15           ` Andrew Murray
2019-06-06 14:39         ` Dan Carpenter
2019-06-13 10:40           ` Andrew Murray

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.