Add a new namespace (MLX5_FLOW_NAMESPACE_OFFLOADS) to be populated
with flow steering rules that deal with rules that have have to
be executed before the EN NIC steering rules are matched.
The namespace is located after the bypass name-space and before the
kernel name-space. Therefore, it precedes the HW processing done for
rules set for the kernel NIC name-space.
Under SRIOV, it would allow us to match on e-switch missed packet
and forward them to the relevant VF representor TIR.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Amir Vadai <amir@vadai.me>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper function to get a cgroup2 from a fd. It will be
stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
be introduced in the later patch.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since bpf_prog_get() and program type check is used in a couple of places,
refactor this into a small helper function that we can make use of. Since
the non RO prog->aux part is not used in performance critical paths and a
program destruction via RCU is rather very unlikley when doing the put, we
shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
check, but actually not taking the ref at all (due to being in fdget() /
fdput() section of the bpf fd) is even cleaner and makes the diff smaller
as well, so just go for that. Callsites are changed to make use of the new
helper where possible.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:
- Set up a process with threads T1, T2 and T3
- Let T1 set up a socket filter F1 that invokes another filter F2
through a BPF map [tail call]
- Let T1 trigger the socket filter via a unix domain socket write,
don't wait for completion
- Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
- Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
- Let T3 close the file descriptor for F2, dropping the reference
count of F2 to 2
- At this point, T1 should have looked up F2 from the map, but not
finished executing it
- Let T3 remove F2 from the BPF map, dropping the reference count of
F2 to 1
- Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
via schedule_work()
- At this point, the BPF program could be freed
- BPF execution is still running in a freed BPF program
While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().
That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in ceb5607035 ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.
Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.
This patch tries to address this by:
- switch from sk_receive_queue to a skb_array, and resize it when
tx_queue_len was changed.
- introduce a new proto_ops peek_len which was used for peeking the
skb length.
- implement a tun version of peek_len for vhost_net to use and convert
vhost_net to use peek_len if possible.
Pktgen test shows about 15.3% improvement on guest receiving pps for small
buffers:
Before: ~1300000pps
After : ~1500000pps
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.
Cc: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes, we need support resizing multiple queues at once. This is
because it was not easy to recover to recover from a partial failure
of multiple queues resizing.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sometimes, we need zero length ring. But current code will crash since
we don't do any check before accessing the ring. This patch fixes this.
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several cases of overlapping changes, except the packet scheduler
conflicts which deal with the addition of the free list parameter
to qdisc_enqueue().
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull audit fixes from Paul Moore:
"Two small patches to fix audit problems in 4.7-rcX: the first fixes a
potential kref leak, the second removes some header file noise.
The first is an important bug fix that really should go in before 4.7
is released, the second is not critical, but falls into the very-nice-
to-have category so I'm including in the pull request.
Both patches are straightforward, self-contained, and pass our
testsuite without problem"
* 'stable-4.7' of git://git.infradead.org/users/pcmoore/audit:
audit: move audit_get_tty to reduce scope and kabi changes
audit: move calcs after alloc and check when logging set loginuid
Pull networking fixes from David Miller:
"I've been traveling so this accumulates more than week or so of bug
fixing. It perhaps looks a little worse than it really is.
1) Fix deadlock in ath10k driver, from Ben Greear.
2) Increase scan timeout in iwlwifi, from Luca Coelho.
3) Unbreak STP by properly reinjecting STP packets back into the
stack. Regression fix from Ido Schimmel.
4) Mediatek driver fixes (missing malloc failure checks, leaking of
scratch memory, wrong indexing when mapping TX buffers, etc.) from
John Crispin.
5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
Sowa.
6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.
7) Fix netlink notifications in ovs for tunnels, delete link messages
are never emitted because of how the device registry state is
handled. From Nicolas Dichtel.
8) Conntrack module leaks kmemcache on unload, from Florian Westphal.
9) Prevent endless jump loops in nft rules, from Liping Zhang and
Pablo Neira Ayuso.
10) Not early enough spinlock initialization in mlx4, from Eric
Dumazet.
11) Bind refcount leak in act_ipt, from Cong WANG.
12) Missing RCU locking in HTB scheduler, from Florian Westphal.
13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
barrier, using heap for SG and IV, and erroneous use of async flag
when allocating AEAD conext.)
14) RCU handling fix in TIPC, from Ying Xue.
15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
SIT driver, from Simon Horman.
16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.
17) Fix potential deadlock in team enslave, from Ido Schimmel.
18) Memory leak in KCM procfs handling, from Jiri Slaby.
19) ESN generation fix in ipv4 ESP, from Herbert Xu.
20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
WANG.
21) Use after free in netem, from Eric Dumazet.
22) Uninitialized last assert time in multicast router code, from Tom
Goff.
23) Skip raw sockets in sock_diag destruction broadcast, from Willem
de Bruijn.
24) Fix link status reporting in thunderx, from Sunil Goutham.
25) Limit resegmentation of retransmit queue so that we do not
retransmit too large GSO frames. From Eric Dumazet.
26) Delay bpf program release after grace period, from Daniel
Borkmann"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
openvswitch: fix conntrack netlink event delivery
qed: Protect the doorbell BAR with the write barriers.
neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
e1000e: keep VLAN interfaces functional after rxvlan off
cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
bpf, perf: delay release of BPF prog after grace period
net: bridge: fix vlan stats continue counter
tcp: do not send too big packets at retransmit time
ibmvnic: fix to use list_for_each_safe() when delete items
net: thunderx: Fix TL4 configuration for secondary Qsets
net: thunderx: Fix link status reporting
net/mlx5e: Reorganize ethtool statistics
net/mlx5e: Fix number of PFC counters reported to ethtool
net/mlx5e: Prevent adding the same vxlan port
net/mlx5e: Check for BlueFlame capability before allocating SQ uar
net/mlx5e: Change enum to better reflect usage
net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
net/mlx5: Update command strings
net: marvell: Add separate config ANEG function for Marvell 88E1111
...
Commit dead9f29dd ("perf: Fix race in BPF program unregister") moved
destruction of BPF program from free_event_rcu() callback to __free_event(),
which is problematic if used with tail calls: if prog A is attached as
trace event directly, but at the same time present in a tail call map used
by another trace event program elsewhere, then we need to delay destruction
via RCU grace period since it can still be in use by the program doing the
tail call (the prog first needs to be dropped from the tail call map, then
trace event with prog A attached destroyed, so we get immediate destruction).
Fixes: dead9f29dd ("perf: Fix race in BPF program unregister")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jann Horn <jann@thejh.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only users of audit_get_tty and audit_put_tty are internal to
audit, so move it out of include/linux/audit.h to kernel.h and create
a proper function rather than inlining it. This also reduces kABI
changes.
Suggested-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: line wrapped description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
Diag intends to broadcast tcp_sk and udp_sk socket destruction.
Testing sk->sk_protocol for IPPROTO_TCP/IPPROTO_UDP alone is not
sufficient for this. Raw sockets can have the same type.
Add a test for sk->sk_type.
Fixes: eb4cb00852 ("sock_diag: define destruction multicast groups")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case of SGMII more, for example when a MAC2MAC connection
is needed, the port selection bits (inside the MAC configuration
registers) have to be programmed according to the link selected.
So the patch adds a new DT parameter to pass the port selection
and to programmed related PCS and CORE to use it.
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calling the fixed-phy functions when CONFIG_FIXED_PHY=m as a previous
change tried cannot work if the caller is in built-in code:
drivers/of/built-in.o: In function `of_phy_register_fixed_link':
of_reserved_mem.c:(.text+0x85e0): undefined reference to `fixed_phy_register'
Making of_mdio depend on 'FIXED_PHY || !FIXED_PHY' would solve this
dependency by enforcing that OF_MDIO itself becomes a loadable module
when FIXED_PHY=y, but that creates a different dependency as it
breaks any built-in ethernet driver that uses of_mdio.
Making FIXED_PHY a bool option also cannot work, since it depends on
PHYLIB, which again is tristate.
This version now uses 'select FIXED_PHY' to ensure that the fixed-phy
portion of of_mdio is not optional. The main downside of this is
a small increase in code size for cases that do not need fixed phy
support, but it should avoid all of the link-time problems.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: d1bd330a22 ("of_mdio: Enable fixed PHY support if driver is a module")
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previous to this patch auto negotiation was reported off although it was
on by default in hardware. This patch reports the correct information to
ethtool and allows the user to toggle it on/off.
Added another parameter to set port proto function in order to pass
the auto negotiation field to the hardware.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a dedicated function to toggle port link. It should be called only
after setting a port register.
Toggle will set port link to down and bring it back up in case that it's
admin status was up.
Signed-off-by: Gal Pressman <galp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Configuring and managing HW rate limit tables.
The HW holds a table of rate limits, each rate is
associated with an index in that table.
Later a Send Queue uses this index to set the rate limit.
Multiple Send Queues can have the same rate limit, which is
represented by a single entry in this table.
Even though a rate can be shared, each queue is being rate
limited independently of others.
The SW shadow of this table holds the rate itself,
the index in the HW table and the refcount (number of queues)
working with this rate.
The exported functions are mlx5_rl_add_rate and mlx5_rl_remove_rate.
Number of different rates and their values are derived
from HW capabilities.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAV2q17PSw1s6N8H32AQKOixAAh+Fw6H7j/biCA73Fi0CtVJpqxvDEoo0d
jGBNtIueiyTnuGpV8yqT23xrgcuaQlLoEPhwQTRPFy/jI2qmah69kuIZzz6ZvNxA
sSsbc4M7PerMwmX/gPwUtvWflal1ECmpc6f+5y3pZBhqowMwm9HwxR0489FCurba
4k1w/OxDQIIH88RIsNcYX129xTIvekzB8bjhkIzfQM3WLfelhYyPr8haTt+CrCuF
gdLB7O+AoCI7rxXuS+7blZq4+AryzNAWjpJQdQXlClF2UJhBDNO6CptmsmL5kpuN
0a6ijumB4Onak6zZhQo5PmvX2UbQbh6QEuGm1ZsdyZoTOFPgynRv0ZVgF4JmBl0t
vhkKrbtcLkSYUHnFUoCBDuJnI9exugUH5a0BjdVPs/J6Zha0eS1pF0IOZEOQzC8j
C4U+dgjKP4OArRLS6rRR5oS99zFijcTR+fNp+0rORwqgiDJhBgkbIO9y2hyDrjr9
OW+Hnbm2EPNF86kLG+zi/OSZ2Af/fX5gtkHvfLETE1rxzZi6G1leBiEOezCVJa1P
/jtr8RR4hDOND8XZ4qvUC0yRBo2ykyW0OjtqZ2No7PuL1z/N6gvaV0PEn62yrP/g
XCcmVxXquTMrQlqu66cQGJ+ZdD6MYa44oAOHGRpX55GCvZyjpMFrLPJ8B9XbXoh3
YXIDu5U8kQs=
=HxJc
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-rewrite-20160622-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Get rid of conn bundle and transport structs
Here's the next part of the AF_RXRPC rewrite. The primary purpose of this
set is to get rid of the rxrpc_conn_bundle and rxrpc_transport structs.
This simplifies things for future development of the connection handling.
To this end, the following significant changes are made:
(1) The rxrpc_connection struct is given pointers to the local and peer
endpoints, inside the rxrpc_conn_parameters struct. Pointers to the
transport's copy of these pointers are then redirected to the
connection struct.
(2) Exclusive connection handling is fixed. Exclusive connections should
do just one call and then be retired. They are used in security
negotiations and, I believe, the idea is to avoid reuse of negotiated
security contexts.
The current code is doing a single connection per socket and doing all
the calls over that. With this change it gets a new connection for
each call made.
(3) A new sendmsg() control message marker is added to make individual
calls operate over exclusive connections. This should be used in
future in preference to the sockopt that marks a socket as "exclusive
connection".
(4) IDs for client connections initiated by a machine are now allocated
from a global pool using the IDR facility and are unique across all
client connections, no matter their destination. The IDR facility is
then used to look up a connection on the connection ID alone. Other
parameters are then verified afterwards.
Note that the IDR facility may use a lot of memory if the IDs it holds
are widely scattered. Given this, in a future commit, client
connections will be retired if they are more than a certain distance
from the last ID allocated.
The client epoch is advanced by 1 each time the client ID counter
wraps. Connections outside the current epoch will also be retired in
a future commit.
(5) The connection bundle concept is removed and the client connection
tree is moved into the local endpoint. The queue for waiting for a
call channel is moved to the rxrpc_connection struct as there can only
be one connection for any particular key going to any particular peer
now.
(6) The rxrpc_transport struct is removed and the service connection tree
is moved into the peer struct.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull locking fix from Thomas Gleixner:
"A single fix to address a race in the static key logic"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/static_key: Fix concurrent static_key_slow_inc()
Merge misc fixes from Andrew Morton:
"Two weeks worth of fixes here"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits)
init/main.c: fix initcall_blacklisted on ia64, ppc64 and parisc64
autofs: don't get stuck in a loop if vfs_write() returns an error
mm/page_owner: avoid null pointer dereference
tools/vm/slabinfo: fix spelling mistake: "Ocurrences" -> "Occurrences"
fs/nilfs2: fix potential underflow in call to crc32_le
oom, suspend: fix oom_reaper vs. oom_killer_disable race
ocfs2: disable BUG assertions in reading blocks
mm, compaction: abort free scanner if split fails
mm: prevent KASAN false positives in kmemleak
mm/hugetlb: clear compound_mapcount when freeing gigantic pages
mm/swap.c: flush lru pvecs on compound page arrival
memcg: css_alloc should return an ERR_PTR value on error
memcg: mem_cgroup_migrate() may be called with irq disabled
hugetlb: fix nr_pmds accounting with shared page tables
Revert "mm: disable fault around on emulated access bit architecture"
Revert "mm: make faultaround produce old ptes"
mailmap: add Boris Brezillon's email
mailmap: add Antoine Tenart's email
mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
mm: mempool: kasan: don't poot mempool objects in quarantine
...
- A couple minor fixes to the rdma core
- Multiple minor fixes to hfi1
- Multiple minor fixes to mlx4/mlx4
- A few minor fixes to i40iw
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXbA1uAAoJELgmozMOVy/dcuEP/j5NyPmyU8XXHDloGU9b8ybu
HdnBGYZhvr2OnhuBOGW/z3dEpbVwsSfP7JTY5Z2M2GTmA0h5R8VMp9G3agCNwKvo
y0bt+GrtOzDVNkBXw+Ttqe9fBYZvMsvLe7zBV0DyBTlLeeE27f+d7ahuvFGayer8
C+i3LLgTlfzP6iImDsUdyEvp1nUc0F5Xb8vcVE4znNjwaUF1nrLRJXhSHSliLeZh
DCOr/ElbTPKm1QL0TX/O2HSxE+5zZq9VfepbKxTTBfHZp5U31ZNZrw+VrXin374h
02QdADLzOXPnUznP5UVJZPNUVdsQ8MAlW/Y9Va5w9G4nOtE6jQq8cWBeuY5g7JAJ
aTmHD60p09xGtrd+9/K7mEoSpHDUWgGl/bqTlGMRZFvBXUSxfdbEjt7TRBH2eASl
e8tOQbfMnlqO5kKDCa5BXKCJaaAeKafzUpa5kbVQZW113dBbaTC6m5qnjEEdwriX
7l4FC4vmAxrzURWd1r7oAwn4gJVDxGmXTUpgI6PzWFx93hRAukIHKsVPZ2U8IKZx
tF8SkcifXXT7F4TqSgn59SGc7BiKfvc7/vO1XBI/8nHO2NsKIVrx1S1X/yOy3Hsd
EJg42g41yr+wjQNAhG8RShgfzWh00BHtVFz+4KUESQRnZZLSMnC/EnEfGV9IzHTf
Jaz+r5+Fh2rYY/hoIn6h
=F0nv
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"This is the second batch of queued up rdma patches for this rc cycle.
There isn't anything really major in here. It's passed 0day,
linux-next, and local testing across a wide variety of hardware.
There are still a few known issues to be tracked down, but this should
amount to the vast majority of the rdma RC fixes.
Round two of 4.7 rc fixes:
- A couple minor fixes to the rdma core
- Multiple minor fixes to hfi1
- Multiple minor fixes to mlx4/mlx4
- A few minor fixes to i40iw"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (31 commits)
IB/srpt: Reduce QP buffer size
i40iw: Enable level-1 PBL for fast memory registration
i40iw: Return correct max_fast_reg_page_list_len
i40iw: Correct status check on i40iw_get_pble
i40iw: Correct CQ arming
IB/rdmavt: Correct qp_priv_alloc() return value test
IB/hfi1: Don't zero out qp->s_ack_queue in rvt_reset_qp
IB/hfi1: Fix deadlock with txreq allocation slow path
IB/mlx4: Prevent cross page boundary allocation
IB/mlx4: Fix memory leak if QP creation failed
IB/mlx4: Verify port number in flow steering create flow
IB/mlx4: Fix error flow when sending mads under SRIOV
IB/mlx4: Fix the SQ size of an RC QP
IB/mlx5: Fix wrong naming of port_rcv_data counter
IB/mlx5: Fix post send fence logic
IB/uverbs: Initialize ib_qp_init_attr with zeros
IB/core: Fix false search of the IB_SA_WELL_KNOWN_GUID
IB/core: Fix RoCE v1 multicast join logic issue
IB/core: Fix no default GIDs when netdevice reregisters
IB/hfi1: Send a pkey change event on driver pkey update
...
This reverts commit 5c0a85fad9.
The commit causes ~6% regression in unixbench.
Let's revert it for now and consider other solution for reclaim problem
later.
Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we may put reserved by mempool elements into quarantine via
kasan_kfree(). This is totally wrong since quarantine may really free
these objects. So when mempool will try to use such element,
use-after-free will happen. Or mempool may decide that it no longer
need that element and double-free it.
So don't put object into quarantine in kasan_kfree(), just poison it.
Rename kasan_kfree() to kasan_poison_kfree() to respect that.
Also, we shouldn't use kasan_slab_alloc()/kasan_krealloc() in
kasan_unpoison_element() because those functions may update allocation
stacktrace. This would be wrong for the most of the remove_element call
sites.
(The only call site where we may want to update alloc stacktrace is
in mempool_alloc(). Kmemleak solves this by calling
kmemleak_update_trace(), so we could make something like that too.
But this is out of scope of this patch).
Fixes: 55834c5909 ("mm: kasan: initial memory quarantine implementation")
Link: http://lkml.kernel.org/r/575977C3.1010905@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The INIT_TASK() initializer was similarly confused about the stack vs
thread_info allocation that the allocators had, and that were fixed in
commit b235beea9e ("Clarify naming of thread info/stack allocators").
The task ->stack pointer only incidentally ends up having the same value
as the thread_info, and in fact that will change.
So fix the initial task struct initializer to point to 'init_stack'
instead of 'init_thread_info', and make sure the ia64 definition for
that exists.
This actually makes the ia64 tsk->stack pointer be sensible for the
initial task, but not for any other task. As mentioned in commit
b235beea9e, that whole pointer isn't actually used on ia64, since
task_stack_page() there just points to the (single) allocation.
All the other architectures seem to have copied the 'init_stack'
definition, even if it tended to be generally unusued.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.
But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.
Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical. That
identity then meant that we would have things like
ti = alloc_thread_info_node(tsk, node);
...
tsk->stack = ti;
which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.
So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack. The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.
This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.
The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter. It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The following scenario is possible:
CPU 1 CPU 2
static_key_slow_inc()
atomic_inc_not_zero()
-> key.enabled == 0, no increment
jump_label_lock()
atomic_inc_return()
-> key.enabled == 1 now
static_key_slow_inc()
atomic_inc_not_zero()
-> key.enabled == 1, inc to 2
return
** static key is wrong!
jump_label_update()
jump_label_unlock()
Testing the static key at the point marked by (**) will follow the
wrong path for jumps that have not been patched yet. This can
actually happen when creating many KVM virtual machines with userspace
LAPIC emulation; just run several copies of the following program:
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/kvm.h>
int main(void)
{
for (;;) {
int kvmfd = open("/dev/kvm", O_RDONLY);
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
close(ioctl(vmfd, KVM_CREATE_VCPU, 1));
close(vmfd);
close(kvmfd);
}
return 0;
}
Every KVM_CREATE_VCPU ioctl will attempt a static_key_slow_inc() call.
The static key's purpose is to skip NULL pointer checks and indeed one
of the processes eventually dereferences NULL.
As explained in the commit that introduced the bug:
706249c222 ("locking/static_keys: Rework update logic")
jump_label_update() needs key.enabled to be true. The solution adopted
here is to temporarily make key.enabled == -1, and use go down the
slow path when key.enabled <= 0.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # v4.3+
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 706249c222 ("locking/static_keys: Rework update logic")
Link: http://lkml.kernel.org/r/1466527937-69798-1-git-send-email-pbonzini@redhat.com
[ Small stylistic edits to the changelog and the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds support for configuring the device tx/rx coalescing
timeout values in the order of micro seconds. It also adds APIs for
upper layer drivers for reading/updating the coalescing values.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for reading and updating priority flow
control (PFC) attributes in the driver via netlink.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The fixed_phy driver doesn't have to be built-in, and it's
important that of_mdio supports it even if it's a module.
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the caller specified IB_SEND_FENCE in the send flags of the work
request and no previous work request stated that the successive one
should be fenced, the work request would be executed without a fence.
This could result in RDMA read or atomic operations failure due to a MR
being invalidated. Fix this by adding the mlx5 enumeration for fencing
RDMA/atomic operations and fix the logic to apply this.
Fixes: e126ba97db ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
This allows a clean shutdown, even if some netdev clients do not
release their reference from this netdev. It is enough to release
the HW resources only as the kernel is shutting down.
Fixes: 2ba5fbd62b ('net/mlx4_core: Handle AER flow properly')
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"Exclusive connections" are meant to be used for a single client call and
then scrapped. The idea is to limit the use of the negotiated security
context. The current code, however, isn't doing this: it is instead
restricting the socket to a single virtual connection and doing all the
calls over that.
This is changed such that the socket no longer maintains a special virtual
connection over which it will do all the calls, but rather gets a new one
each time a new exclusive call is made.
Further, using a socket option for this is a poor choice. It should be
done on sendmsg with a control message marker instead so that calls can be
marked exclusive individually. To that end, add RXRPC_EXCLUSIVE_CALL
which, if passed to sendmsg() as a control message element, will cause the
call to be done on an single-use connection.
The socket option (RXRPC_EXCLUSIVE_CONNECTION) still exists and, if set,
will override any lack of RXRPC_EXCLUSIVE_CALL being specified so that
programs using the setsockopt() will appear to work the same.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull vfs fixes from Al Viro:
"A couple more of d_walk()/d_subdirs reordering fixes (stable fodder;
ought to solve that crap for good) and a fix for a brown paperbag bug
in d_alloc_parallel() (this cycle)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix idiotic braino in d_alloc_parallel()
autofs races
much milder d_walk() race
Several user APIs can cause driver to perform an inner-reload.
Currently, doing this would cause the HW/FW statistics of the
adapter to reset, which isn't the expected behavior [statistics
should only reset on explicit unloads].
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving an ICMPv4 message containing extensions as
defined in RFC 4884, and translating it to ICMPv6 at SIT
or GRE tunnel, we need some extra manipulation in order
to properly forward the extensions.
This patch only takes care of Time Exceeded messages as they
are the ones that typically carry information from various
routers in a fabric during a traceroute session.
It also avoids complex skb logic if the data_len is not
a multiple of 8.
RFC states :
The "original datagram" field MUST contain at least 128 octets.
If the original datagram did not contain 128 octets, the
"original datagram" field MUST be zero padded to 128 octets.
In practice routers use 128 bytes of original datagram, not more.
Initial translation was added in commit ca15a078bd
("sit: generate icmpv6 error when receiving icmpv4 error")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For better traceroute/mtr support for SIT and GRE tunnels,
we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED
We also have to translate the IPv4 source IP address of ICMP
message to IPv6 v4mapped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to use this helper from GRE as well, so this is
the time to move it in net/ipv6/icmp.c
Also add a @nhs parameter, since SIT and GRE have different
values for the header(s) to skip.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT or GRE tunnels might want to translate an IPV4 address
into a v4mapped one when translating ICMP to ICMPv6.
This patch adds the parameter to icmp6_send() but
does not change icmpv6_send() signature.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Here are a bunch (65) of USB fixes for 4.7-rc4. Sorry about the
quantity, I've been slow in getting these out. The majority are the
"normal" gadget, musb, and xhci fixes, that we all are used to. There
are also a few other tiny fixes resolving a number of reported issues
that showed up in 4.7-rc1.
All of these have been in linux-next.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAldlbRgACgkQMUfUDdst+ynNQACeMwami14392grxjLQvhCtR29u
ekkAoMK+vLu79mbLPhoBmPo4YCUvSptw
=1X1d
-----END PGP SIGNATURE-----
Merge tag 'usb-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a bunch (65) of USB fixes for 4.7-rc4. Sorry about the
quantity, I've been slow in getting these out.
The majority are the "normal" gadget, musb, and xhci fixes, that we
all are used to. There are also a few other tiny fixes resolving a
number of reported issues that showed up in 4.7-rc1.
All of these have been in linux-next"
* tag 'usb-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (65 commits)
usbip: rate limit get_frame_number message
usb: musb: sunxi: Remove bogus "Frees glue" comment
usb: musb: sunxi: Fix NULL ptr deref when gadget is registered before musb
usb: echi-hcd: Add ehci_setup check before echi_shutdown
usb: host: ehci-msm: Conditionally call ehci suspend/resume
MAINTAINERS: Add file patterns for usb device tree bindings
usb: host: ehci-tegra: Avoid getting the same reset twice
usb: host: ehci-tegra: Grab the correct UTMI pads reset
USB: mos7720: delete parport
USB: OHCI: Don't mark EDs as ED_OPER if scheduling fails
phy: ti-pipe3: Program the DPLL even if it was already locked
usb: musb: Stop bulk endpoint while queue is rotated
usb: musb: Ensure rx reinit occurs for shared_fifo endpoints
usb: musb: host: correct cppi dma channel for isoch transfer
usb: musb: only restore devctl when session was set in backup
usb: phy: Check initial state for twl6030
usb: musb: Use normal module_init for 2430 glue
usb: musb: Remove pm_runtime_set_irq_safe
usb: musb: Remove extra PM runtime calls from 2430 glue layer
usb: musb: Return error value from musb_mailbox
...
Here are a number of IIO and Staging bugfixes for 4.7-rc4.
Nothing huge, the normal amount of iio driver fixes, and some small
staging driver bugfixes for some reported problems (2 are reverts of
patches that went into 4.7-rc1). All have been in linux-next with no
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAldlbB4ACgkQMUfUDdst+ykpkACggRrmGzuiDe1hQHbq1J+jaokX
V2kAoK0WiQ66uv4nd1FbLx2CcUb0xgI5
=c2n1
-----END PGP SIGNATURE-----
Merge tag 'staging-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull IIO and staging fixes from Greg KH:
"Here are a number of IIO and staging bugfixes for 4.7-rc4.
Nothing huge, the normal amount of iio driver fixes, and some small
staging driver bugfixes for some reported problems (2 are reverts of
patches that went into 4.7-rc1). All have been in linux-next with no
reported issues"
* tag 'staging-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (24 commits)
Revert "Staging: rtl8188eu: rtw_efuse: Use sizeof type *pointer instead of sizeof type."
Revert "Staging: drivers: rtl8188eu: use sizeof(*ptr) instead of sizeof(struct)"
staging: lustre: lnet: Don't access NULL NI on failure path
iio: hudmidity: hdc100x: fix incorrect shifting and scaling
iio: light apds9960: Add the missing dev.parent
iio: Fix error handling in iio_trigger_attach_poll_func
iio: st_sensors: Disable DRDY at init time
iio: st_sensors: Init trigger before irq request
iio: st_sensors: switch to a threaded interrupt
iio: light: bh1780: assign a static name
iio: bh1780: dereference the client properly
iio: humidity: hdc100x: fix IIO_TEMP channel reporting
iio:st_pressure: fix sampling gains (bring inline with ABI)
iio: proximity: as3935: fix buffer stack trashing
iio: proximity: as3935: remove triggered buffer processing
iio: proximity: as3935: correct IIO_CHAN_INFO_RAW output
max44000: Remove scale from proximity
iio: humidity: am2315: Remove a stray unlock
iio: humidity: hdc100x: correct humidity integration time mask
iio: pressure: bmp280: fix error message for wrong chip id
...
Here are a small number of debugfs, ISA, and one driver core fix for 4.7-rc4.
All of these resolve reported issues. The ISA ones have spent the least
amount of time in linux-next, sorry about that, I didn't realize they
were regressions that needed to get in now (thanks to Thorsten for the
prodding!) but they do all pass the 0-day bot tests. The others have
been in linux-next for a while now.
Full details about them are in the shortlog below.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAldlbeQACgkQMUfUDdst+ymnFACfaWhEKA/84jwNNHiim92diJrY
zYsAoLOmpBw68yL6qTSZbcWJF4Flb6Xk
=N8M2
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are a small number of debugfs, ISA, and one driver core fix for
4.7-rc4.
All of these resolve reported issues. The ISA ones have spent the
least amount of time in linux-next, sorry about that, I didn't realize
they were regressions that needed to get in now (thanks to Thorsten
for the prodding!) but they do all pass the 0-day bot tests. The
others have been in linux-next for a while now.
Full details about them are in the shortlog below"
* tag 'driver-core-4.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
isa: Dummy isa_register_driver should return error code
isa: Call isa_bus_init before dependent ISA bus drivers register
watchdog: ebc-c384_wdt: Allow build for X86_64
iio: stx104: Allow build for X86_64
gpio: Allow PC/104 devices on X86_64
isa: Allow ISA-style drivers on modern systems
base: make module_create_drivers_dir race-free
debugfs: open_proxy_open(): avoid double fops release
debugfs: full_proxy_open(): free proxy on ->open() failure
kernel/kcov: unproxify debugfs file's fops
The inline isa_register_driver stub simply allows compilation on systems
with CONFIG_ISA disabled; the dummy isa_register_driver does not
register an isa_driver at all. The inline isa_register_driver should
return -ENODEV to indicate lack of support when attempting to register
an isa_driver on such a system with CONFIG_ISA disabled.
Cc: Matthew Wilcox <matthew@wil.cx>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Ye Xiaolong
Signed-off-by: William Breathitt Gray <vilhelm.gray@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Now that we have all the drivers using udp_tunnel_get_rx_ports,
ndo_add_udp_enc_rx_port, and ndo_del_udp_enc_rx_port we can drop the
function calls that were specific to VXLAN and GENEVE.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch merges the notifiers for VXLAN and GENEVE into a single UDP
tunnel notifier. The idea is that we will want to only have to make one
notifier call to receive the list of ports for VXLAN and GENEVE tunnels
that need to be offloaded.
In addition we add a new set of ndo functions named ndo_udp_tunnel_add and
ndo_udp_tunnel_del that are meant to allow us to track the tunnel meta-data
such as port and address family as tunnels are added and removed. The
tunnel meta-data is now transported in a structure named udp_tunnel_info
which for now carries the type, address family, and port number. In the
future this could be updated so that we can include a tuple of values
including things such as the destination IP address and other fields.
I also ended up going with a naming scheme that consisted of using the
prefix udp_tunnel on function names. I applied this to the notifier and
ndo ops as well so that it hopefully points to the fact that these are
primarily used in the udp_tunnel functions.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Several modern devices, such as PC/104 cards, are expected to run on
modern systems via an ISA bus interface. Since ISA is a legacy interface
for most modern architectures, ISA support should remain disabled in
general. Support for ISA-style drivers should be enabled on a per driver
basis.
To allow ISA-style drivers on modern systems, this patch introduces the
ISA_BUS_API and ISA_BUS Kconfig options. The ISA bus driver will now
build conditionally on the ISA_BUS_API Kconfig option, which defaults to
the legacy ISA Kconfig option. The ISA_BUS Kconfig option allows the
ISA_BUS_API Kconfig option to be selected on architectures which do not
enable ISA (e.g. X86_64).
The ISA_BUS Kconfig option is currently only implemented for X86
architectures. Other architectures may have their own ISA_BUS Kconfig
options added as required.
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: William Breathitt Gray <vilhelm.gray@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>