This patch:
- Adds a utility function for parsing a 64 bit address
- Adds a utility function for converting a 64 bit address to ASCII
- Adds and ILA encap type in lwt tunnels
Signed-off-by: Tom Herbert <tom@herbertland.com>
The recently introduced object pinning can be further extended in order
to allow sharing maps beyond tc namespace. F.e. maps that are being pinned
from tracing side, can be accessed through this facility as well.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
This larger work addresses one of the bigger remaining issues on
tc's eBPF frontend, that is, to allow for persistent file descriptors.
Whenever tc parses the ELF object, extracts and loads maps into the
kernel, these file descriptors will be out of reach after the tc
instance exits.
Meaning, for simple (unnested) programs which contain one or
multiple maps, the kernel holds a reference, and they will live
on inside the kernel until the program holding them is unloaded,
but they will be out of reach for user space, even worse with
(also multiple nested) tail calls.
For this issue, we introduced the concept of an agent that can
receive the set of file descriptors from the tc instance creating
them, in order to be able to further inspect/update map data for
a specific use case. However, while that is more tied towards
specific applications, it still doesn't easily allow for sharing
maps accross multiple tc instances and would require a daemon to
be running in the background. F.e. when a map should be shared by
two eBPF programs, one attached to ingress, one to egress, this
currently doesn't work with the tc frontend.
This work solves exactly that, i.e. if requested, maps can now be
_arbitrarily_ shared between object files (PIN_GLOBAL_NS) or within
a single object (but various program sections, PIN_OBJECT_NS) without
"loosing" the file descriptor set. To make that happen, we use eBPF
object pinning introduced in kernel commit b2197755b263 ("bpf: add
support for persistent maps/progs") for exactly this purpose.
The shipped examples/bpf/bpf_shared.c code from this patch can be
easily applied, for instance, as:
- classifier-classifier shared:
tc filter add dev foo parent 1: bpf obj shared.o sec egress
tc filter add dev foo parent ffff: bpf obj shared.o sec ingress
- classifier-action shared (here: late binding to a dummy classifier):
tc actions add action bpf obj shared.o sec egress pass index 42
tc filter add dev foo parent ffff: bpf obj shared.o sec ingress
tc filter add dev foo parent 1: bpf bytecode '1,6 0 0 4294967295,' \
action bpf index 42
The toy example increments a shared counter on egress and dumps its
value on ingress (if no sharing (PIN_NONE) would have been chosen,
map value is 0, of course, due to the two map instances being created):
[...]
<idle>-0 [002] ..s. 38264.788234: : map val: 4
<idle>-0 [002] ..s. 38264.788919: : map val: 4
<idle>-0 [002] ..s. 38264.789599: : map val: 5
[...]
... thus if both sections reference the pinned map(s) in question,
tc will take care of fetching the appropriate file descriptor.
The patch has been tested extensively on both, classifier and
action sides.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch adds support to parse and print lwtunnel
encapsulation attributes attached to routes for MPLS
and IP tunnels.
example:
Add ipv4 route with mpls encap attributes:
Examples:
MPLS:
$ ip route add 40.1.2.0/30 encap mpls 200 via inet 40.1.1.1 dev eth3
$ ip route show
40.1.2.0/30 encap mpls 200 via 40.1.1.1 dev eth3
Add ipv4 multipath route with mpls encap attributes:
$ ip route add 10.1.1.0/30 nexthop encap mpls 200 via 10.1.1.1 dev eth0 \
nexthop encap mpls 700 via 40.1.1.2 dev eth3
$ ip route show
10.1.1.0/30
nexthop encap mpls 200 via 10.1.1.1 dev eth0 weight 1
nexthop encap mpls 700 via 40.1.1.2 dev eth3 weight 1
IP:
$ ip route add 10.1.1.1/24 encap ip id 200 dst 20.1.1.1 dev vxlan0
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Benc <jbenc@redhat.com>
When having optional classid, most minimal command can be sth
like:
tc filter add dev foo parent X: bpf obj prog.o
Therefore, adapt the code so that a next argument will not be
enforced as the case currently.
Also, minor cleanup on the classid, where we should rather
have used addattr32(), and add flags for exec configuration,
for example (using short notation):
tc filter add dev foo parent X: bpf da obj prog.o
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
This adds support for slightly less output than is normally provided by
'ip link show' and 'ip addr show'. This is a bit better when you have a
host with lots of interfaces. Sample output:
$ ip -br link show
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
p2p1 UP 08:00:27:ee:0b:3b <BROADCAST,MULTICAST,UP,LOWER_UP>
p7p1 UP 08:00:27:9d:62:9f <BROADCAST,MULTICAST,UP,LOWER_UP>
p8p1 DOWN 08:00:27:dc:d8:ca <NO-CARRIER,BROADCAST,MULTICAST,UP>
p9p1 UP 08:00:27:76:d9:75 <BROADCAST,MULTICAST,UP,LOWER_UP>
p7p1.100@p7p1 UP 08:00:27:9d:62:9f <BROADCAST,MULTICAST,UP,LOWER_UP>
$ ip -br -4 addr show
lo UNKNOWN 127.0.0.1/8
p2p1 UP 192.168.56.2/24
p7p1 UP 70.0.0.1/24
p8p1 DOWN 80.0.0.1/24
p9p1 UP 10.0.5.15/24
p7p1.100@p7p1 UP 200.0.0.1/24
$ ip -br -6 addr show
lo UNKNOWN ::1/128
p2p1 UP fe80::a00:27ff:feee:b3b/64
p7p1 UP 7000::1/8 fe80::a00:27ff:fe9d:629f/64
p8p1 DOWN 8000::1/8
p9p1 UP fe80::a00:27ff:fe76:d975/64
p7p1.100@p7p1 UP fe80::a00:27ff:fe9d:629f/64
$ ip -br addr show p7p1
p7p1 UP 70.0.0.1/24 7000::1/8 fe80::a00:27ff:fe9d:629f/64
v2: Now with color support!
v3: Better field width estimation (except netdev names to keep output at a
decent width) and whitespace fixup.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
This work finalizes both eBPF front-ends for the classifier and action
part in tc, it allows for custom ELF section selection, a simplified tc
command frontend (while keeping compat), reusing of common maps between
classifier and actions residing in the same object file, and exporting
of all map fds to an eBPF agent for handing off further control in user
space.
It also adds an extensive example of how eBPF can be used, and a minimal
self-contained example agent that dumps map data. The example is well
documented and hopefully provides a good starting point into programming
cls_bpf and act_bpf.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
- Pull in the uapi mpls.h
- Update rtnetlink.h to include the mpls rtnetlink notification multicast group.
- Define AF_MPLS in utils.h if it is not defined from elsewhere
as is done with AF_DECnet
The address syntax for multiple mpls labels is a complete invention.
When I looked there seemed to be no wide spread convention for talking
about an mpls label stack in text for. Sometimes people did:
"{ Label1, Label2, Label3 }", sometimes people would do:
"[ label3, label2, label1 ]", and most of the time label
stacks were not explicitly shown at all.
The syntax I wound up using, so it would not have spaces and so it
would visually distinct from other kinds of addresses is.
label1/label2/label3 Where label1 is the label at the top of the label
stack and label3 is the label at the bottom on the label stack.
When there is a single label this matches what seems to be convention
with other tools. Just print out the numeric value of the mpls label.
The netlink protocol for labels uses the on the wire format for a
label stack. The ttl and traffic class are expected to be 0. Using
the on the wire format is common and what happens with other address
types. BGP when passing label stacks also uses this technique with the
exception that the ttl byte is not included making each label in a BGP
label stack 3 bytes instead of 4.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add support for the RTA_VIA attribute that specifies an address family
as well as an address for the next hop gateway.
To make it easy to pass this reorder inet_prefix so that it's tail
is a proper RTA_VIA attribute.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Add the functions family_name and read_family to convert an address
family to a string and to convernt a string to an address family.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
For some address families (like AF_PACKET) it is helpful to have the
length when prenting the address.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
This work adds the tc frontend for kernel commit e2e9b6541dd4 ("cls_bpf:
add initial eBPF support for programmable classifiers").
A C-like classifier program (f.e. see e2e9b6541dd4) is being compiled via
LLVM's eBPF backend into an ELF file, that is then being passed to tc. tc
then loads, if any, eBPF maps and eBPF opcodes (with fixed-up eBPF map file
descriptors) out of its dedicated sections, and via bpf(2) into the kernel
and then the resulting fd via netlink down to cls_bpf. cls_bpf allows for
annotations, currently, I've used the file name for that, so that the user
can easily identify his filter when dumping configurations back.
Example usage:
clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
tc filter add dev em1 parent 1: bpf run object-file cls.o classid x:y
tc filter show dev em1 [...]
filter parent 1: protocol all pref 49152 bpf handle 0x1 flowid x:y cls.o
I placed the parser bits derived from Alexei's kernel sample, into tc_bpf.c
as my next step is to also add the same support for BPF action, so we can
have a fully fledged eBPF classifier and action in tc.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
This change allows to exec some cmd on each
named netns (except default) by specifying '-all' option:
# ip -all netns exec ip link
Each command executes synchronously.
Exit status is not considered, so there might be a case
that some CMD can fail on some netns but success on the other.
EXAMPLES:
1) Show link info on all netns:
$ ip -all netns exec ip link
netns: test_net
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 500
link/ether 1a:19:6f:25:eb:85 brd ff:ff:ff:ff:ff:ff
netns: home0
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 500
link/ether ea:1a:59:40:d3:29 brd ff:ff:ff:ff:ff:ff
netns: lan0
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 500
link/ether ce:49:d5:46:81:ea brd ff:ff:ff:ff:ff:ff
2) Set UP tap0 device for the all netns:
$ ip -all netns exec ip link set dev tap0 up
netns: test_net
netns: home0
netns: lan0
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
Added another timestamp format to look like more logging info:
[2014-12-22T22:36:50.489 ] 2: enp0s25: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default
link/ether 3c:97:0e:a3:86:2e brd ff:ff:ff:ff:ff:ff
Signed-off-by: Vadim Kochan <vadim4j@gmail.com>
The RTM_NEWLINK message accepts ifi_index non-zero value and lets
creation of links with given index (if it's free, or course). This
functionality is available since linux-v3.5.
This patch makes this API available via ip tool.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The kernel already supports it, so add the support
to iproute2 as well.
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
The get_jiffies() function retrieves rtt-type values in units of
milliseconds. This patch updates the function name accordingly,
following the pattern given by dst_metric() <=> dst_metric_rtt().
Add the group keyword to ip link set, which has the following meaning:
If both a group and a device name are pressent, we change the device's
group to the specified one. If only a group is present, then the
operation specified by the rest of the command should apply on an entire
group, not a single device.
So, to set eth0 to the default group, one would use
ip link set dev eth0 group default
Conversely, to set all the devices in the default group down, use
ip link set group default down
Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
The default remains at 10 for backwards compatibility.
For instance:
# ip addr flush dev eth2
*** Flush remains incomplete after 10 rounds. ***
# ip -l 20 addr flush dev eth2
*** Flush remains incomplete after 20 rounds. ***
# ip -loops 0 addr flush dev eth2
#
This is useful for getting rid of large numbers of IP
addresses in scripts.
Signed-off-by: Ben Greear <greearb@candelatech.com>
“ip maddr show” on an infiniband address causes a stack corruption
because the length of the address for Infiniband (20 bytes, as
described in kernel doc Documentation/infiniband/ipoib.txt) does not
fit on the 16 bytes of the field in which it gets stored.
The proposed patch increases the size of the hardware address from 4
__u32 to 8 and also adds a check to avoid overriding the available
size while parsing the hardware address.
This bug affects current upstream code AFAICT.
Hope this helps,
Cheers,
Olivier.
“ip maddr show ib0” causes a stack corruption because the length of the address
for Infiniband (20 see kernel doc Documentation/infiniband/ipoib.txt) does not
fit on the 16 bytes of the field in which it gets stored.
The proposed patch increases the size of the hardware address from 4 u32 to 8
and adds a check to avoid overriding the available size while parsing the
hardware address.
This routine parses CLI attributes, describing generic link
parameters such as name, address, etc.
This is mostly copy-pasted from iplink_modify().
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
The problem was that length of allocation changed but caller not told.
Anyway, the patch fixes a problem resulting in a double free
that occurs when using batch files that contains a special combination
of broken up lines and comments as reported in:
http://bugs.debian.org/398912
Thanks to Michal Pokrywka <mpokrywka@hoga.pl> for testcase and information
on which conditions problem could be reproduced under.
Signed-off-by: Andreas Henriksson <andreas@fatal.se>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Enable users of ip to specify the times for rtt, rttvar and rto_min
in human-friendly terms a la "tc" while maintaining backwards
compatability with the previous "raw" mechanism. Builds upon
David Miller's uncommited patch to set rto_min.
Signed-off-by: Rick Jones <rick.jones2@hp.com>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
cheers,
jamal
[GENERAL] nl_mgrp to crap if base multicast groups exceeded
The old scheme of bitmasks works only for the first 32 groups.
Above that the setsockopt scheme must be used.
Signed-off-by: J Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
[utils] make muticast group to bitmask conversion generic
Signed-off-by: J Hadi Salim <hadi@cyberus.ca>
Signed-off-by: Stephen Hemminger <shemminger@osdl.org>