bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
#include <linux/bpf.h>
|
|
|
|
#include <linux/vmalloc.h>
|
|
|
|
#include <linux/fdtable.h>
|
|
|
|
#include <linux/file.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/kernel.h>
|
|
|
|
#include <linux/idr.h>
|
|
|
|
#include <linux/namei.h>
|
|
|
|
#include <linux/user_namespace.h>
|
2023-12-01 02:52:23 +08:00
|
|
|
#include <linux/security.h>
|
bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
|
|
|
|
bool bpf_token_capable(const struct bpf_token *token, int cap)
|
|
|
|
{
|
|
|
|
/* BPF token allows ns_capable() level of capabilities, but only if
|
|
|
|
* token's userns is *exactly* the same as current user's userns
|
|
|
|
*/
|
|
|
|
if (token && current_user_ns() == token->userns) {
|
2023-12-01 02:52:23 +08:00
|
|
|
if (ns_capable(token->userns, cap) ||
|
|
|
|
(cap != CAP_SYS_ADMIN && ns_capable(token->userns, CAP_SYS_ADMIN)))
|
|
|
|
return security_bpf_token_capable(token, cap) == 0;
|
bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
}
|
|
|
|
/* otherwise fallback to capable() checks */
|
|
|
|
return capable(cap) || (cap != CAP_SYS_ADMIN && capable(CAP_SYS_ADMIN));
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_token_inc(struct bpf_token *token)
|
|
|
|
{
|
|
|
|
atomic64_inc(&token->refcnt);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_token_free(struct bpf_token *token)
|
|
|
|
{
|
2023-12-01 02:52:23 +08:00
|
|
|
security_bpf_token_free(token);
|
bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
put_user_ns(token->userns);
|
|
|
|
kvfree(token);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_token_put_deferred(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct bpf_token *token = container_of(work, struct bpf_token, work);
|
|
|
|
|
|
|
|
bpf_token_free(token);
|
|
|
|
}
|
|
|
|
|
|
|
|
void bpf_token_put(struct bpf_token *token)
|
|
|
|
{
|
|
|
|
if (!token)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!atomic64_dec_and_test(&token->refcnt))
|
|
|
|
return;
|
|
|
|
|
|
|
|
INIT_WORK(&token->work, bpf_token_put_deferred);
|
|
|
|
schedule_work(&token->work);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int bpf_token_release(struct inode *inode, struct file *filp)
|
|
|
|
{
|
|
|
|
struct bpf_token *token = filp->private_data;
|
|
|
|
|
|
|
|
bpf_token_put(token);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void bpf_token_show_fdinfo(struct seq_file *m, struct file *filp)
|
|
|
|
{
|
|
|
|
struct bpf_token *token = filp->private_data;
|
|
|
|
u64 mask;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(__MAX_BPF_CMD >= 64);
|
|
|
|
mask = (1ULL << __MAX_BPF_CMD) - 1;
|
|
|
|
if ((token->allowed_cmds & mask) == mask)
|
|
|
|
seq_printf(m, "allowed_cmds:\tany\n");
|
|
|
|
else
|
|
|
|
seq_printf(m, "allowed_cmds:\t0x%llx\n", token->allowed_cmds);
|
2023-12-01 02:52:16 +08:00
|
|
|
|
|
|
|
BUILD_BUG_ON(__MAX_BPF_MAP_TYPE >= 64);
|
|
|
|
mask = (1ULL << __MAX_BPF_MAP_TYPE) - 1;
|
|
|
|
if ((token->allowed_maps & mask) == mask)
|
|
|
|
seq_printf(m, "allowed_maps:\tany\n");
|
|
|
|
else
|
|
|
|
seq_printf(m, "allowed_maps:\t0x%llx\n", token->allowed_maps);
|
2023-12-01 02:52:18 +08:00
|
|
|
|
|
|
|
BUILD_BUG_ON(__MAX_BPF_PROG_TYPE >= 64);
|
|
|
|
mask = (1ULL << __MAX_BPF_PROG_TYPE) - 1;
|
|
|
|
if ((token->allowed_progs & mask) == mask)
|
|
|
|
seq_printf(m, "allowed_progs:\tany\n");
|
|
|
|
else
|
|
|
|
seq_printf(m, "allowed_progs:\t0x%llx\n", token->allowed_progs);
|
|
|
|
|
|
|
|
BUILD_BUG_ON(__MAX_BPF_ATTACH_TYPE >= 64);
|
|
|
|
mask = (1ULL << __MAX_BPF_ATTACH_TYPE) - 1;
|
|
|
|
if ((token->allowed_attachs & mask) == mask)
|
|
|
|
seq_printf(m, "allowed_attachs:\tany\n");
|
|
|
|
else
|
|
|
|
seq_printf(m, "allowed_attachs:\t0x%llx\n", token->allowed_attachs);
|
bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
#define BPF_TOKEN_INODE_NAME "bpf-token"
|
|
|
|
|
|
|
|
static const struct inode_operations bpf_token_iops = { };
|
|
|
|
|
|
|
|
static const struct file_operations bpf_token_fops = {
|
|
|
|
.release = bpf_token_release,
|
|
|
|
.show_fdinfo = bpf_token_show_fdinfo,
|
|
|
|
};
|
|
|
|
|
|
|
|
int bpf_token_create(union bpf_attr *attr)
|
|
|
|
{
|
|
|
|
struct bpf_mount_opts *mnt_opts;
|
|
|
|
struct bpf_token *token = NULL;
|
|
|
|
struct user_namespace *userns;
|
|
|
|
struct inode *inode;
|
|
|
|
struct file *file;
|
|
|
|
struct path path;
|
|
|
|
struct fd f;
|
|
|
|
umode_t mode;
|
|
|
|
int err, fd;
|
|
|
|
|
|
|
|
f = fdget(attr->token_create.bpffs_fd);
|
|
|
|
if (!f.file)
|
|
|
|
return -EBADF;
|
|
|
|
|
|
|
|
path = f.file->f_path;
|
|
|
|
path_get(&path);
|
|
|
|
fdput(f);
|
|
|
|
|
|
|
|
if (path.dentry != path.mnt->mnt_sb->s_root) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out_path;
|
|
|
|
}
|
|
|
|
if (path.mnt->mnt_sb->s_op != &bpf_super_ops) {
|
|
|
|
err = -EINVAL;
|
|
|
|
goto out_path;
|
|
|
|
}
|
|
|
|
err = path_permission(&path, MAY_ACCESS);
|
|
|
|
if (err)
|
|
|
|
goto out_path;
|
|
|
|
|
|
|
|
userns = path.dentry->d_sb->s_user_ns;
|
|
|
|
/*
|
|
|
|
* Enforce that creators of BPF tokens are in the same user
|
|
|
|
* namespace as the BPF FS instance. This makes reasoning about
|
|
|
|
* permissions a lot easier and we can always relax this later.
|
|
|
|
*/
|
|
|
|
if (current_user_ns() != userns) {
|
|
|
|
err = -EPERM;
|
|
|
|
goto out_path;
|
|
|
|
}
|
|
|
|
if (!ns_capable(userns, CAP_BPF)) {
|
|
|
|
err = -EPERM;
|
|
|
|
goto out_path;
|
|
|
|
}
|
|
|
|
|
|
|
|
mode = S_IFREG | ((S_IRUSR | S_IWUSR) & ~current_umask());
|
|
|
|
inode = bpf_get_inode(path.mnt->mnt_sb, NULL, mode);
|
|
|
|
if (IS_ERR(inode)) {
|
|
|
|
err = PTR_ERR(inode);
|
|
|
|
goto out_path;
|
|
|
|
}
|
|
|
|
|
|
|
|
inode->i_op = &bpf_token_iops;
|
|
|
|
inode->i_fop = &bpf_token_fops;
|
|
|
|
clear_nlink(inode); /* make sure it is unlinked */
|
|
|
|
|
|
|
|
file = alloc_file_pseudo(inode, path.mnt, BPF_TOKEN_INODE_NAME, O_RDWR, &bpf_token_fops);
|
|
|
|
if (IS_ERR(file)) {
|
|
|
|
iput(inode);
|
|
|
|
err = PTR_ERR(file);
|
|
|
|
goto out_path;
|
|
|
|
}
|
|
|
|
|
|
|
|
token = kvzalloc(sizeof(*token), GFP_USER);
|
|
|
|
if (!token) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out_file;
|
|
|
|
}
|
|
|
|
|
|
|
|
atomic64_set(&token->refcnt, 1);
|
|
|
|
|
|
|
|
/* remember bpffs owning userns for future ns_capable() checks */
|
|
|
|
token->userns = get_user_ns(userns);
|
|
|
|
|
|
|
|
mnt_opts = path.dentry->d_sb->s_fs_info;
|
|
|
|
token->allowed_cmds = mnt_opts->delegate_cmds;
|
2023-12-01 02:52:16 +08:00
|
|
|
token->allowed_maps = mnt_opts->delegate_maps;
|
2023-12-01 02:52:18 +08:00
|
|
|
token->allowed_progs = mnt_opts->delegate_progs;
|
|
|
|
token->allowed_attachs = mnt_opts->delegate_attachs;
|
bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
|
2023-12-01 02:52:23 +08:00
|
|
|
err = security_bpf_token_create(token, attr, &path);
|
|
|
|
if (err)
|
|
|
|
goto out_token;
|
|
|
|
|
bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
fd = get_unused_fd_flags(O_CLOEXEC);
|
|
|
|
if (fd < 0) {
|
|
|
|
err = fd;
|
|
|
|
goto out_token;
|
|
|
|
}
|
|
|
|
|
|
|
|
file->private_data = token;
|
|
|
|
fd_install(fd, file);
|
|
|
|
|
|
|
|
path_put(&path);
|
|
|
|
return fd;
|
|
|
|
|
|
|
|
out_token:
|
|
|
|
bpf_token_free(token);
|
|
|
|
out_file:
|
|
|
|
fput(file);
|
|
|
|
out_path:
|
|
|
|
path_put(&path);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct bpf_token *bpf_token_get_from_fd(u32 ufd)
|
|
|
|
{
|
|
|
|
struct fd f = fdget(ufd);
|
|
|
|
struct bpf_token *token;
|
|
|
|
|
|
|
|
if (!f.file)
|
|
|
|
return ERR_PTR(-EBADF);
|
|
|
|
if (f.file->f_op != &bpf_token_fops) {
|
|
|
|
fdput(f);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
token = f.file->private_data;
|
|
|
|
bpf_token_inc(token);
|
|
|
|
fdput(f);
|
|
|
|
|
|
|
|
return token;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool bpf_token_allow_cmd(const struct bpf_token *token, enum bpf_cmd cmd)
|
|
|
|
{
|
|
|
|
/* BPF token can be used only within exactly the same userns in which
|
|
|
|
* it was created
|
|
|
|
*/
|
|
|
|
if (!token || current_user_ns() != token->userns)
|
|
|
|
return false;
|
2023-12-01 02:52:23 +08:00
|
|
|
if (!(token->allowed_cmds & (1ULL << cmd)))
|
|
|
|
return false;
|
|
|
|
return security_bpf_token_cmd(token, cmd) == 0;
|
bpf: introduce BPF token object
Add new kind of BPF kernel object, BPF token. BPF token is meant to
allow delegating privileged BPF functionality, like loading a BPF
program or creating a BPF map, from privileged process to a *trusted*
unprivileged process, all while having a good amount of control over which
privileged operations could be performed using provided BPF token.
This is achieved through mounting BPF FS instance with extra delegation
mount options, which determine what operations are delegatable, and also
constraining it to the owning user namespace (as mentioned in the
previous patch).
BPF token itself is just a derivative from BPF FS and can be created
through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
FS FD, which can be attained through open() API by opening BPF FS mount
point. Currently, BPF token "inherits" delegated command, map types,
prog type, and attach type bit sets from BPF FS as is. In the future,
having an BPF token as a separate object with its own FD, we can allow
to further restrict BPF token's allowable set of things either at the
creation time or after the fact, allowing the process to guard itself
further from unintentionally trying to load undesired kind of BPF
programs. But for now we keep things simple and just copy bit sets as is.
When BPF token is created from BPF FS mount, we take reference to the
BPF super block's owning user namespace, and then use that namespace for
checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
capabilities that are normally only checked against init userns (using
capable()), but now we check them using ns_capable() instead (if BPF
token is provided). See bpf_token_capable() for details.
Such setup means that BPF token in itself is not sufficient to grant BPF
functionality. User namespaced process has to *also* have necessary
combination of capabilities inside that user namespace. So while
previously CAP_BPF was useless when granted within user namespace, now
it gains a meaning and allows container managers and sys admins to have
a flexible control over which processes can and need to use BPF
functionality within the user namespace (i.e., container in practice).
And BPF FS delegation mount options and derived BPF tokens serve as
a per-container "flag" to grant overall ability to use bpf() (plus further
restrict on which parts of bpf() syscalls are treated as namespaced).
Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
within the BPF FS owning user namespace, rounding up the ns_capable()
story of BPF token.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01 02:52:15 +08:00
|
|
|
}
|
2023-12-01 02:52:16 +08:00
|
|
|
|
|
|
|
bool bpf_token_allow_map_type(const struct bpf_token *token, enum bpf_map_type type)
|
|
|
|
{
|
|
|
|
if (!token || type >= __MAX_BPF_MAP_TYPE)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return token->allowed_maps & (1ULL << type);
|
|
|
|
}
|
2023-12-01 02:52:18 +08:00
|
|
|
|
|
|
|
bool bpf_token_allow_prog_type(const struct bpf_token *token,
|
|
|
|
enum bpf_prog_type prog_type,
|
|
|
|
enum bpf_attach_type attach_type)
|
|
|
|
{
|
|
|
|
if (!token || prog_type >= __MAX_BPF_PROG_TYPE || attach_type >= __MAX_BPF_ATTACH_TYPE)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return (token->allowed_progs & (1ULL << prog_type)) &&
|
|
|
|
(token->allowed_attachs & (1ULL << attach_type));
|
|
|
|
}
|