Dongpo Li says:
====================
Add Hisilicon MDIO bus driver and FEMAC driver
This patch set adds a Hisilicon MDIO bus driver and
a Fast Ethernet MAC(FEMAC) driver.
We also abstract a general interface "of_phy_get_and_connect"
for PHY connect. User will have no bother with getting
"phy-mode" and "phy-handle" any more.
Changes in v1:
- Pass private data structure instead of struct mii_bus
in MDIO read and write operation.
- Return the error which devm_clk_get() gives when MDIO probe.
- Leave the clock unprepared and disabled on error when MDIO probe.
- Abstract a general interface "of_phy_get_and_connect" for PHY connect.
- Remove the "_reset" suffixes in "reset-names" property.
- Enable tx per-packet interrupt when tx fifo full.
- Remove pointless compatible and add SoC specific compatible.
- Declare only one clock in MAC dts documentation.
- Add standard unit suffixes for "phy-reset-delays".
- Use a smaller NAPI poll weight 16 for our Fast Ethernet MAC.
- Use phy_ethtool_{get|set}_link_ksettings for ethtool ops.
- Use phydev from struct net_device in MAC driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the Hisilicon Fast Ethernet MAC(FEMAC) driver.
The FEMAC supports max speed 100Mbps and has been used in many
Hisilicon SoC.
Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
Reviewed-by: Jiancheng Xue <xuejiancheng@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Abstract a general interface "of_phy_get_and_connect"
for PHY connect. User will have no bother with getting
"phy-mode" and "phy-handle" any more.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
Reviewed-by: Jiancheng Xue <xuejiancheng@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a separate driver for the MDIO interface of the
Hisilicon Fast Ethernet MAC.
Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
Reviewed-by: Jiancheng Xue <xuejiancheng@hisilicon.com>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
TI_CPSW_PHY_SEL depended on TI_CPSW and was selected by the latter. So
there is no reason to have this symbol visible.
A further optimisation would be to put the code for both symbols into a
single module which would allow to not export at least cpsw_phy_sel()
and simplify the module load process.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was used err_xxx for labeled statement, it is
not easy to understand, now use free_xxx for labeled
statement.
Signed-off-by: Zhao Qiang <qiang.zhao@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for hardware offloading of ipmr/ip6mr we need an
interface that allows to check (and later update) the age of entries.
Relying on stats alone can show activity but not actual age of the entry,
furthermore when there're tens of thousands of entries a lot of the
hardware implementations only support "hit" bits which are cleared on
read to denote that the entry was active and shouldn't be aged out,
these can then be naturally translated into age timestamp and will be
compatible with the software forwarding age. Using a lastuse entry doesn't
affect performance because the members in that cache line are written to
along with the age.
Since all new users are encouraged to use ipmr via netlink, this is
exported via the RTA_EXPIRES attribute.
Also do a minor local variable declaration style adjustment - arrange them
longest to shortest.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Shrijeet Mukherjee <shm@cumulusnetworks.com>
CC: Satish Ashok <sashok@cumulusnetworks.com>
CC: Donald Sharp <sharpd@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
macsec can't cope with mtu frames which need vlan tag insertion, and
vlan device set the default mtu equal to the underlying dev's one.
By default vlan over macsec devices use invalid mtu, dropping
all the large packets.
This patch adds a netif helper to check if an upper vlan device
needs mtu reduction. The helper is used during vlan devices
initialization to set a valid default and during mtu updating to
forbid invalid, too bit, mtu values.
The helper currently only check if the lower dev is a macsec device,
if we get more users, we need to update only the helper (possibly
reserving an additional IFF bit).
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some devices of the same type all export the same, random MAC address. This
behavior has been seen on the ZTE MF910, MF823 and MF831, and there are
probably more devices out there. Fix this by generating a valid random MAC
address if we read a random MAC from device.
Also, changed the memcpy() to ether_addr_copy(), as pointed out by
checkpatch.
Suggested-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Aleksandrov says:
====================
net: bridge: simplify receive path and consolidate forwarding paths
This set tries to simplify the receive and forwarding paths. Patch 01 is
a trivial style adjustment, patch 02 removes one conditional from the
unicast fast path, patch 03 removes another conditional and more imporantly
removes the skb0/skb2 ambiguity about locally receiving the skb and
switches to a boolean called "local_rcv".
Patch 04 is the most important change which consolidates the forwarding
paths for locally originated and forwarded packets into __br_forward. This
allows us to remove the function pointers giving a minor performance boost,
more importantly it makes it much easier to reason about the forwarding
path and reduces the code duplication that was needed when making changes.
Also it allows the receive path to fully setup the environment prior to
calling any forwarding functions (i.e. to properly set unicast, local_rcv
and search for unicast/mcast dst).
Functionally everything should stay the same after this set.
I've done basic tests with unicast/multicast/broadcast Tx/Rx. Please
review carefully.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch we had two flavors of most forwarding functions -
_forward and _deliver, the difference being that the latter are used
when the packets are locally originated. Instead of all this function
pointer passing and code duplication, we can just pass a boolean noting
that the packet was locally originated and use that to perform the
necessary checks in __br_forward. This gives a minor performance
improvement but more importantly consolidates the forwarding paths.
Also add a kernel doc comment to explain the exported br_forward()'s
arguments.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently if the packet is going to be received locally we set skb0 or
sometimes called skb2 variables to the original skb. This can get
confusing and also we can avoid one conditional on the fast path by
simply using a boolean and passing it around. Thanks to Roopa for the
name suggestion.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes one conditional from the unicast path by using the fact
that skb is NULL only when the packet is multicast or is local.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Trivial style changes in br_handle_frame_finish.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.
Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
One regression in the Device Tree handling for OMAP NAND handling of the ELM
node. TI migrated to using the property name "ti,elm-id", but forgot to keep
compatibility with the old "elm_id" property.
Also, might as well send out this MAINTAINERS fixup now.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXiYHwAAoJEFySrpd9RFgtJK0P/0xH8ChIrWio8zakcndyjIb+
LdHXlkrQfXs/6vzVAaZLeVI/KnElUL4jIVr2Xg4QYYLdyg/VzOyOGMpb2hdNvYZo
RSJf2wI+k0vcP68CQFROl+Sj2FOpWjDRB92zxyikk1D++O6jOLQWK4oUBhNgximG
qmPBl7mzhjAPrFOu1DJVIcaXxC2t5JQffAUCy0rrGBmhfiZgKxlwDnS7raumj6eq
8xBil5UoFDfIWqneh5kKphexm3t0gSdibi4V2W6EKvRK2WAhcunfBLEld7qo0Zy1
lgdaoLgEsgqjA58oQ/4MdVMZDPfin4JlKsdUcWRVXpGl5nxIB6iAJzyTHPHgltL3
aLJFjP0oT9emUI4T4cAzWRYa9M2RKOIjwfNrrjWYjkb3NOa4OIg+9xWgy8CkkeJG
BTGndVCBjXLZ1k6enQUKZ8Wf+c8BRZlVFTsvxFx89VOie3+NwfUK6Cv6mOXUdCk8
TyxYF/8R2fazP46fSCv9tW2A0FakHsNqqVm9kUDEV+c/juLtzJCHTwwRUjFJxopv
2oyHqeAUjNx65usp+vTw96oHp3BXef8Cw/9PIck3R6E6LVaZuXKlMBADP6/DLYmS
XoufM25SuPg6d0WcSzcaket60tP8wNPhsn4MB0W0rHGnMaoKY4svbew0IGSIwJPt
uWaPMn/FOVWTxcID1ln1
=tSga
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20160715' of git://git.infradead.org/linux-mtd
Pull MTD fix from Brian Norris:
"Late MTD fix for v4.7:
One regression in the Device Tree handling for OMAP NAND handling of
the ELM node. TI migrated to using the property name "ti,elm-id", but
forgot to keep compatibility with the old "elm_id" property.
Also, might as well send out this MAINTAINERS fixup now"
* tag 'for-linus-20160715' of git://git.infradead.org/linux-mtd:
mtd: nand: omap2: Add check for old elm binding
MAINTAINERS: Add file patterns for mtd device tree bindings
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
There was a check on CAP_NET_ADMIN in cpmac_set_settings, but this
check is already done in dev_ethtool, so no need to repeat it before
calling the generic function.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
There was a check on CAP_NET_ADMIN in au1000_set_settings, but this
check is already done in dev_ethtool, so no need to repeat it before
calling the generic function.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two generics functions phy_ethtool_{get|set}_link_ksettings,
so we can use them instead of defining the same code in the driver.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nothing is decrementing the index "i" while we are cleaning up the
fragments we could not successful transmit.
Fixes: 9cde94506e ("bgmac: implement scatter/gather support")
Reported-by: coverity (CID 1352048)
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The recent change to tracepoint napi:napi_poll changed the order of
the parameters that perf scripts sees, the printk was correct. The
problem was that the new parameters (work and budget) were pushed
in front of dev_name.
The new parameters obviously need to be appended to keep backward
compatible.
Fixes: 1db19db7f5 ("net: tracepoint napi:napi_poll add work and budget")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull input fixes from Dmitry Torokhov:
"A few last-minute updates for the input subsystem"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: ts4800-ts - add missing of_node_put after calling of_parse_phandle
Input: synaptics-rmi4 - use of_get_child_by_name() to fix refcount
Revert "Input: wacom_w8001 - drop use of ABS_MT_TOOL_TYPE"
Input: xpad - validate USB endpoint count during probe
Input: add SW_PEN_INSERTED define
Jiri Pirko says:
====================
mlxsw: Couple of fixes
Couple of fixes for mlxsw driver from Ido.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Packets entering the switch are mapped to a Switch Priority (SP)
according to their PCP value (untagged frames are mapped to SP 0).
The packets are classified to a priority group (PG) buffer in the port's
headroom according to their SP.
The switch maintains another mapping (SP to IEEE priority), which is
used to generate PFC frames for lossless PGs. This mapping is
initialized to IEEE = SP % 8.
Therefore, when mapping SP 'x' to PG 'y' we create a situation in which
an IEEE priority is mapped to two different PGs:
IEEE 'x' ---> SP 'x' ---> PG 'y'
IEEE 'x' ---> SP 'x + 8' ---> PG '0' (default)
Which is invalid, as a flow can use only one PG buffer.
Fix this by mapping both SP 'x' and 'x + 8' to the same PG buffer.
Fixes: 8e8dfe9fdf ("mlxsw: spectrum: Add IEEE 802.1Qaz ETS support")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of supported traffic classes that can have ETS and PFC
simultaneously enabled is not subject to user configuration, so make
sure we always initialize them to the correct values following a set
operation.
Fixes: 8e8dfe9fdf ("mlxsw: spectrum: Add IEEE 802.1Qaz ETS support")
Fixes: d81a6bdb87 ("mlxsw: spectrum: Add IEEE 802.1Qbb PFC support")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't have PAUSE frames and PFC both enabled on the same port, but
the fact that ieee_setpfc() was called doesn't necessarily mean PFC is
enabled.
Only emit errors when PAUSE frames and PFC are enabled simultaneously.
Fixes: d81a6bdb87 ("mlxsw: spectrum: Add IEEE 802.1Qbb PFC support")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The device supports link autonegotiation, so let the user know about it
by indicating support via ethtool ops.
Fixes: 56ade8fe3f ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting a new speed we need to disable and enable the port for the
changes to take effect. We currently only do that if the operational
state of the port is up. However, setting a new speed following link
training failure will require us to explicitly set the port down and then
up.
Instead, disable and enable the port based on its administrative state.
Fixes: 56ade8fe3f ("mlxsw: spectrum: Add initial support for Spectrum ASIC")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch switch to use skb array instead of sk_receive_queue to
avoid spinlock contentions. Tests shows about 21% improvements for
guest rx pps:
Before: 1472731 pkts/s
After: 1786289 pkts/s
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We decide the rxq through calculating its hash which is not necessary
if we only have one rx queue. So this patch skip this and just return
queue 0. Test shows 22% improving on guest rx pps.
Before: 1201504 pkts/s
After: 1472731 pkts/s
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull workqueue fix from Tejun Heo:
"The optimization for setting unbound worker affinity masks collided
with recent scheduler changes triggering warning messages.
This late pull request fixes the bug by removing the optimization"
* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix setting affinity of unbound worker threads
Without this check, the following XFS_I invocations would return bad
pointers when used on non-XFS inodes (perhaps pointers into preceding
allocator chunks).
This could be used by an attacker to trick xfs_swap_extents into
performing locking operations on attacker-chosen structures in kernel
memory, potentially leading to code execution in the kernel. (I have
not investigated how likely this is to be usable for an attack in
practice.)
Signed-off-by: Jann Horn <jann@thejh.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2016-07-14
This series contains fixes to i40e and ixgbe.
Alex fixes issues found in i40e_rx_checksum() which was broken, where the
checksum was being returned valid when it was not.
Kiran fixes a bug which was found when we abruptly remove a cable which
caused a panic. Set the VSI broadcast promiscuous mode during VSI add
sequence and prevents adding MAC filter if specified MAC address is
broadcast.
Paolo Abeni fixes a bug by returning the actual work done, capped to
weight - 1, since the core doesn't allow to return the full budget when
the driver modifies the NAPI status.
Guilherme Piccoli fixes an issue where the q_vector initialization
routine sets the affinity _mask of a q_vector based on v_idx value.
This means a loop iterates on v_idx, which is an incremental value, and
the cpumask is created based on this value. This is a problem in
systems with multiple logical CPUs per core (like in SMT scenarios).
Changed the way q_vector's affinity_mask is created to resolve the issue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool -i provides a driver version that is hard coded.
Export the same value via "modinfo".
Signed-off-by: Grant Grundler <grundler@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
BPF event output helper improvements
This set adds improvements to the BPF event output helper to
support non-linear data sampling, here specifically, for skb
context. For details please see individual patches. The set
is based against net-next tree.
v1 -> v2:
- Integrated and adapted Peter's diff into patch 1, updated
the remaining ones accordingly. Thanks Peter!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This work addresses a couple of issues bpf_skb_event_output()
helper currently has: i) We need two copies instead of just a
single one for the skb data when it should be part of a sample.
The data can be non-linear and thus needs to be extracted via
bpf_skb_load_bytes() helper first, and then copied once again
into the ring buffer slot. ii) Since bpf_skb_load_bytes()
currently needs to be used first, the helper needs to see a
constant size on the passed stack buffer to make sure BPF
verifier can do sanity checks on it during verification time.
Thus, just passing skb->len (or any other non-constant value)
wouldn't work, but changing bpf_skb_load_bytes() is also not
the proper solution, since the two copies are generally still
needed. iii) bpf_skb_load_bytes() is just for rather small
buffers like headers, since they need to sit on the limited
BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
this work improves the bpf_skb_event_output() helper to address
all 3 at once.
We can make use of the passed in skb context that we have in
the helper anyway, and use some of the reserved flag bits as
a length argument. The helper will use the new __output_custom()
facility from perf side with bpf_skb_copy() as callback helper
to walk and extract the data. It will pass the data for setup
to bpf_event_output(), which generates and pushes the raw record
with an additional frag part. The linear data used in the first
frag of the record serves as programmatically defined meta data
passed along with the appended sample.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split the bpf_perf_event_output() helper as a preparation into
two parts. The new bpf_perf_event_output() will prepare the raw
record itself and test for unknown flags from BPF trace context,
where the __bpf_perf_event_output() does the core work. The
latter will be reused later on from bpf_event_output() directly.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for non-linear data on raw records. It
extends raw records to have one or multiple fragments that will
be written linearly into the ring slot, where each fragment can
optionally have a custom callback handler to walk and extract
complex, possibly non-linear data.
If a callback handler is provided for a fragment, then the new
__output_custom() will be used instead of __output_copy() for
the perf_output_sample() part. perf_prepare_sample() does all
the size calculation only once, so perf_output_sample() doesn't
need to redo the same work anymore, meaning real_size and padding
will be cached in the raw record. The raw record becomes 32 bytes
in size without holes; to not increase it further and to avoid
doing unnecessary recalculations in fast-path, we can reuse
next pointer of the last fragment, idea here is borrowed from
ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
path for PERF_SAMPLE_RAW minimal.
This facility is needed for BPF's event output helper as a first
user that will, in a follow-up, add an additional perf_raw_frag
to its perf_raw_record in order to be able to more efficiently
dump skb context after a linear head meta data related to it.
skbs can be non-linear and thus need a custom output function to
dump buffers. Currently, the skb data needs to be copied twice;
with the help of __output_custom() this work only needs to be
done once. Future users could be things like XDP/BPF programs
that work on different context though and would thus also have
a different callback function.
The few users of raw records are adapted to initialize their frag
data from the raw record itself, no change in behavior for them.
The code is based upon a PoC diff provided by Peter Zijlstra [1].
[1] http://thread.gmane.org/gmane.linux.network/421294
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>