user_events: Add minimal support for trace_event into ftrace

Minimal support for interacting with dynamic events, trace_event and
ftrace. Core outline of flow between user process, ioctl and trace_event
APIs.

User mode processes that wish to use trace events to get data into
ftrace, perf, eBPF, etc are limited to uprobes today. The user events
features enables an ABI for user mode processes to create and write to
trace events that are isolated from kernel level trace events. This
enables a faster path for tracing from user mode data as well as opens
managed code to participate in trace events, where stub locations are
dynamic.

User processes often want to trace only when it's useful. To enable this
a set of pages are mapped into the user process space that indicate the
current state of the user events that have been registered. User
processes can check if their event is hooked to a trace/probe, and if it
is, emit the event data out via the write() syscall.

Two new files are introduced into tracefs to accomplish this:
user_events_status - This file is mmap'd into participating user mode
processes to indicate event status.

user_events_data - This file is opened and register/delete ioctl's are
issued to create/open/delete trace events that can be used for tracing.

The typical scenario is on process start to mmap user_events_status. Processes
then register the events they plan to use via the REG ioctl. The ioctl reads
and updates the passed in user_reg struct. The status_index of the struct is
used to know the byte in the status page to check for that event. The
write_index of the struct is used to describe that event when writing out to
the fd that was used for the ioctl call. The data must always include this
index first when writing out data for an event. Data can be written either by
write() or by writev().

For example, in memory:
int index;
char data[];

Psuedo code example of typical usage:
struct user_reg reg;

int page_fd = open("user_events_status", O_RDWR);
char *page_data = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, page_fd, 0);
close(page_fd);

int data_fd = open("user_events_data", O_RDWR);

reg.size = sizeof(reg);
reg.name_args = (__u64)"test";

ioctl(data_fd, DIAG_IOCSREG, &reg);
int status_id = reg.status_index;
int write_id = reg.write_index;

struct iovec io[2];
io[0].iov_base = &write_id;
io[0].iov_len = sizeof(write_id);
io[1].iov_base = payload;
io[1].iov_len = sizeof(payload);

if (page_data[status_id])
	writev(data_fd, io, 2);

User events are also exposed via the dynamic_events tracefs file for
both create and delete. Current status is exposed via the user_events_status
tracefs file.

Simple example to register a user event via dynamic_events:
	echo u:test >> dynamic_events
	cat dynamic_events
	u:test

If an event is hooked to a probe, the probe hooked shows up:
	echo 1 > events/user_events/test/enable
	cat user_events_status
	1:test # Used by ftrace

	Active: 1
	Busy: 1
	Max: 4096

If an event is not hooked to a probe, no probe status shows up:
	echo 0 > events/user_events/test/enable
	cat user_events_status
	1:test

	Active: 1
	Busy: 0
	Max: 4096

Users can describe the trace event format via the following format:
	name[:FLAG1[,FLAG2...] [field1[;field2...]]

Each field has the following format:
	type name

Example for char array with a size of 20 named msg:
	echo 'u:detailed char[20] msg' >> dynamic_events
	cat dynamic_events
	u:detailed char[20] msg

Data offsets are based on the data written out via write() and will be
updated to reflect the correct offset in the trace_event fields. For dynamic
data it is recommended to use the new __rel_loc data type. This type will be
the same as __data_loc, but the offset is relative to this entry. This allows
user_events to not worry about what common fields are being inserted before
the data.

The above format is valid for both the ioctl and the dynamic_events file.

Link: https://lkml.kernel.org/r/20220118204326.2169-2-beaub@linux.microsoft.com

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This commit is contained in:
Beau Belgrave 2022-01-18 12:43:15 -08:00 committed by Steven Rostedt (Google)
parent 55bc8384d3
commit 7f5a08c79d
4 changed files with 1318 additions and 0 deletions

View File

@ -0,0 +1,116 @@
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
/*
* Copyright (c) 2021, Microsoft Corporation.
*
* Authors:
* Beau Belgrave <beaub@linux.microsoft.com>
*/
#ifndef _UAPI_LINUX_USER_EVENTS_H
#define _UAPI_LINUX_USER_EVENTS_H
#include <linux/types.h>
#include <linux/ioctl.h>
#ifdef __KERNEL__
#include <linux/uio.h>
#else
#include <sys/uio.h>
#endif
#define USER_EVENTS_SYSTEM "user_events"
#define USER_EVENTS_PREFIX "u:"
/* Bits 0-6 are for known probe types, Bit 7 is for unknown probes */
#define EVENT_BIT_FTRACE 0
#define EVENT_BIT_PERF 1
#define EVENT_BIT_OTHER 7
#define EVENT_STATUS_FTRACE (1 << EVENT_BIT_FTRACE)
#define EVENT_STATUS_PERF (1 << EVENT_BIT_PERF)
#define EVENT_STATUS_OTHER (1 << EVENT_BIT_OTHER)
/* Create dynamic location entry within a 32-bit value */
#define DYN_LOC(offset, size) ((size) << 16 | (offset))
/* Use raw iterator for attached BPF program(s), no affect on ftrace/perf */
#define FLAG_BPF_ITER (1 << 0)
/*
* Describes an event registration and stores the results of the registration.
* This structure is passed to the DIAG_IOCSREG ioctl, callers at a minimum
* must set the size and name_args before invocation.
*/
struct user_reg {
/* Input: Size of the user_reg structure being used */
__u32 size;
/* Input: Pointer to string with event name, description and flags */
__u64 name_args;
/* Output: Byte index of the event within the status page */
__u32 status_index;
/* Output: Index of the event to use when writing data */
__u32 write_index;
};
#define DIAG_IOC_MAGIC '*'
/* Requests to register a user_event */
#define DIAG_IOCSREG _IOWR(DIAG_IOC_MAGIC, 0, struct user_reg*)
/* Requests to delete a user_event */
#define DIAG_IOCSDEL _IOW(DIAG_IOC_MAGIC, 1, char*)
/* Data type that was passed to the BPF program */
enum {
/* Data resides in kernel space */
USER_BPF_DATA_KERNEL,
/* Data resides in user space */
USER_BPF_DATA_USER,
/* Data is a pointer to a user_bpf_iter structure */
USER_BPF_DATA_ITER,
};
/*
* Describes an iovec iterator that BPF programs can use to access data for
* a given user_event write() / writev() call.
*/
struct user_bpf_iter {
/* Offset of the data within the first iovec */
__u32 iov_offset;
/* Number of iovec structures */
__u32 nr_segs;
/* Pointer to iovec structures */
const struct iovec *iov;
};
/* Context that BPF programs receive when attached to a user_event */
struct user_bpf_context {
/* Data type being passed (see union below) */
__u32 data_type;
/* Length of the data */
__u32 data_len;
/* Pointer to data, varies by data type */
union {
/* Kernel data (data_type == USER_BPF_DATA_KERNEL) */
void *kdata;
/* User data (data_type == USER_BPF_DATA_USER) */
void *udata;
/* Direct iovec (data_type == USER_BPF_DATA_ITER) */
struct user_bpf_iter *iter;
};
};
#endif /* _UAPI_LINUX_USER_EVENTS_H */

View File

@ -737,6 +737,20 @@ config SYNTH_EVENTS
If in doubt, say N.
config USER_EVENTS
bool "User trace events"
select TRACING
select DYNAMIC_EVENTS
help
User trace events are user-defined trace events that
can be used like an existing kernel trace event. User trace
events are generated by writing to a tracefs file. User
processes can determine if their tracing events should be
generated by memory mapping a tracefs file and checking for
an associated byte being non-zero.
If in doubt, say N.
config HIST_TRIGGERS
bool "Histogram triggers"
depends on ARCH_HAVE_NMI_SAFE_CMPXCHG

View File

@ -82,6 +82,7 @@ obj-$(CONFIG_PROBE_EVENTS) += trace_eprobe.o
obj-$(CONFIG_TRACE_EVENT_INJECT) += trace_events_inject.o
obj-$(CONFIG_SYNTH_EVENTS) += trace_events_synth.o
obj-$(CONFIG_HIST_TRIGGERS) += trace_events_hist.o
obj-$(CONFIG_USER_EVENTS) += trace_events_user.o
obj-$(CONFIG_BPF_EVENTS) += bpf_trace.o
obj-$(CONFIG_KPROBE_EVENTS) += trace_kprobe.o
obj-$(CONFIG_TRACEPOINTS) += error_report-traces.o

File diff suppressed because it is too large Load Diff