Introduce NETIF_F_GRO_HW feature flag for NICs that support hardware
GRO. With this flag, we can now independently turn on or off hardware
GRO when GRO is on. Previously, drivers were using NETIF_F_GRO to
control hardware GRO and so it cannot be independently turned on or
off without affecting GRO.
Hardware GRO (just like GRO) guarantees that packets can be re-segmented
by TSO/GSO to reconstruct the original packet stream. Logically,
GRO_HW should depend on GRO since it a subset, but we will let
individual drivers enforce this dependency as they see fit.
Since NETIF_F_GRO is not propagated between upper and lower devices,
NETIF_F_GRO_HW should follow suit since it is a subset of GRO. In other
words, a lower device can independent have GRO/GRO_HW enabled or disabled
and no feature propagation is required. This will preserve the current
GRO behavior. This can be changed later if we decide to propagate GRO/
GRO_HW/RXCSUM from upper to lower devices.
Cc: Ariel Elior <Ariel.Elior@cavium.com>
Cc: everest-linux-l2@cavium.com
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2017-12-18
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Allow arbitrary function calls from one BPF function to another BPF function.
As of today when writing BPF programs, __always_inline had to be used in
the BPF C programs for all functions, unnecessarily causing LLVM to inflate
code size. Handle this more naturally with support for BPF to BPF calls
such that this __always_inline restriction can be overcome. As a result,
it allows for better optimized code and finally enables to introduce core
BPF libraries in the future that can be reused out of different projects.
x86 and arm64 JIT support was added as well, from Alexei.
2) Add infrastructure for tagging functions as error injectable and allow for
BPF to return arbitrary error values when BPF is attached via kprobes on
those. This way of injecting errors generically eases testing and debugging
without having to recompile or restart the kernel. Tags for opting-in for
this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.
3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
call for XDP programs. First part of this work adds handling of BPF
capabilities included in the firmware, and the later patches add support
to the nfp verifier part and JIT as well as some small optimizations,
from Jakub.
4) The bpftool now also gets support for basic cgroup BPF operations such
as attaching, detaching and listing current BPF programs. As a requirement
for the attach part, bpftool can now also load object files through
'bpftool prog load'. This reuses libbpf which we have in the kernel tree
as well. bpftool-cgroup man page is added along with it, from Roman.
5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
a single perf event") added support for attaching multiple BPF programs
to a single perf event. Given they are configured through perf's ioctl()
interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
command in this work in order to return an array of one or multiple BPF
prog ids that are currently attached, from Yonghong.
6) Various minor fixes and cleanups to the bpftool's Makefile as well
as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
itself or prior installed documentation related to it, from Quentin.
7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
required for the test_dev_cgroup test case to run, from Naresh.
8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.
9) Fix libbpf's exit code from the Makefile when libelf was not found in
the system, also from Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Typical JIT does several passes over bpf instructions to
compute total size and relative offsets of jumps and calls.
With multitple bpf functions calling each other all relative calls
will have invalid offsets intially therefore we need to additional
last pass over the program to emit calls with correct offsets.
For example in case of three bpf functions:
main:
call foo
call bpf_map_lookup
exit
foo:
call bar
exit
bar:
exit
We will call bpf_int_jit_compile() indepedently for main(), foo() and bar()
x64 JIT typically does 4-5 passes to converge.
After these initial passes the image for these 3 functions
will be good except call targets, since start addresses of
foo() and bar() are unknown when we were JITing main()
(note that call bpf_map_lookup will be resolved properly
during initial passes).
Once start addresses of 3 functions are known we patch
call_insn->imm to point to right functions and call
bpf_int_jit_compile() again which needs only one pass.
Additional safety checks are done to make sure this
last pass doesn't produce image that is larger or smaller
than previous pass.
When constant blinding is on it's applied to all functions
at the first pass, since doing it once again at the last
pass can change size of the JITed code.
Tested on x64 and arm64 hw with JIT on/off, blinding on/off.
x64 jits bpf-to-bpf calls correctly while arm64 falls back to interpreter.
All other JITs that support normal BPF_CALL will behave the same way
since bpf-to-bpf call is equivalent to bpf-to-kernel call from
JITs point of view.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
global bpf_jit_enable variable is tested multiple times in JITs,
blinding and verifier core. The malicious root can try to toggle
it while loading the programs. This race condition was accounted
for and there should be no issues, but it's safer to avoid
this race condition.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
though bpf_call is still the same call instruction and
calling convention 'bpf to bpf' and 'bpf to helper' is the same
the interpreter has to oparate on 'struct bpf_insn *'.
To distinguish these two cases add a kernel internal opcode and
mark call insns with it.
This opcode is seen by interpreter only. JITs will never see it.
Also add tiny bit of debug code to aid interpreter debugging.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
programs with function calls are often passing various
pointers via stack. When all calls are inlined llvm
flattens stack accesses and optimizes away extra branches.
When functions are not inlined it becomes the job of
the verifier to recognize zero initialized stack to avoid
exploring paths that program will not take.
The following program would fail otherwise:
ptr = &buffer_on_stack;
*ptr = 0;
...
func_call(.., ptr, ...) {
if (..)
*ptr = bpf_map_lookup();
}
...
if (*ptr != 0) {
// Access (*ptr)->field is valid.
// Without stack_zero tracking such (*ptr)->field access
// will be rejected
}
since stack slots are no longer uniform invalid | spill | misc
add liveness marking to all slots, but do it in 8 byte chunks.
So if nothing was read or written in [fp-16, fp-9] range
it will be marked as LIVE_NONE.
If any byte in that range was read, it will be marked LIVE_READ
and stacksafe() check will perform byte-by-byte verification.
If all bytes in the range were written the slot will be
marked as LIVE_WRITTEN.
This significantly speeds up state equality comparison
and reduces total number of states processed.
before after
bpf_lb-DLB_L3.o 2051 2003
bpf_lb-DLB_L4.o 3287 3164
bpf_lb-DUNKNOWN.o 1080 1080
bpf_lxc-DDROP_ALL.o 24980 12361
bpf_lxc-DUNKNOWN.o 34308 16605
bpf_netdev.o 15404 10962
bpf_overlay.o 7191 6679
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow arbitrary function calls from bpf function to another bpf function.
To recognize such set of bpf functions the verifier does:
1. runs control flow analysis to detect function boundaries
2. proceeds with verification of all functions starting from main(root) function
It recognizes that the stack of the caller can be accessed by the callee
(if the caller passed a pointer to its stack to the callee) and the callee
can store map_value and other pointers into the stack of the caller.
3. keeps track of the stack_depth of each function to make sure that total
stack depth is still less than 512 bytes
4. disallows pointers to the callee stack to be stored into the caller stack,
since they will be invalid as soon as the callee returns
5. to reuse all of the existing state_pruning logic each function call
is considered to be independent call from the verifier point of view.
The verifier pretends to inline all function calls it sees are being called.
It stores the callsite instruction index as part of the state to make sure
that two calls to the same callee from two different places in the caller
will be different from state pruning point of view
6. more safety checks are added to liveness analysis
Implementation details:
. struct bpf_verifier_state is now consists of all stack frames that
led to this function
. struct bpf_func_state represent one stack frame. It consists of
registers in the given frame and its stack
. propagate_liveness() logic had a premature optimization where
mark_reg_read() and mark_stack_slot_read() were manually inlined
with loop iterating over parents for each register or stack slot.
Undo this optimization to reuse more complex mark_*_read() logic
. skip_callee() logic is not necessary from safety point of view,
but without it mark_*_read() markings become too conservative,
since after returning from the funciton call a read of r6-r9
will incorrectly propagate the read marks into callee causing
inefficient pruning later
. mark_*_read() logic is now aware of control flow which makes it
more complex. In the future the plan is to rewrite liveness
to be hierarchical. So that liveness can be done within
basic block only and control flow will be responsible for
propagation of liveness information along cfg and between calls.
. tail_calls and ld_abs insns are not allowed in the programs with
bpf-to-bpf calls
. returning stack pointers to the caller or storing them into stack
frame of the caller is not allowed
Testing:
. no difference in cilium processed_insn numbers
. large number of tests follows in next patches
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Allow arbitrary function calls from bpf function to another bpf function.
Since the beginning of bpf all bpf programs were represented as a single function
and program authors were forced to use always_inline for all functions
in their C code. That was causing llvm to unnecessary inflate the code size
and forcing developers to move code to header files with little code reuse.
With a bit of additional complexity teach verifier to recognize
arbitrary function calls from one bpf function to another as long as
all of functions are presented to the verifier as a single bpf program.
New program layout:
r6 = r1 // some code
..
r1 = .. // arg1
r2 = .. // arg2
call pc+1 // function call pc-relative
exit
.. = r1 // access arg1
.. = r2 // access arg2
..
call pc+20 // second level of function call
...
It allows for better optimized code and finally allows to introduce
the core bpf libraries that can be reused in different projects,
since programs are no longer limited by single elf file.
With function calls bpf can be compiled into multiple .o files.
This patch is the first step. It detects programs that contain
multiple functions and checks that calls between them are valid.
It splits the sequence of bpf instructions (one program) into a set
of bpf functions that call each other. Calls to only known
functions are allowed. In the future the verifier may allow
calls to unresolved functions and will do dynamic linking.
This logic supports statically linked bpf functions only.
Such function boundary detection could have been done as part of
control flow graph building in check_cfg(), but it's cleaner to
separate function boundary detection vs control flow checks within
a subprogram (function) into logically indepedent steps.
Follow up patches may split check_cfg() further, but not check_subprogs().
Only allow bpf-to-bpf calls for root only and for non-hw-offloaded programs.
These restrictions can be relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Three sets of overlapping changes, two in the packet scheduler
and one in the meson-gxl PHY driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Clamp timeouts to INT_MAX in conntrack, from Jay Elliot.
2) Fix broken UAPI for BPF_PROG_TYPE_PERF_EVENT, from Hendrik
Brueckner.
3) Fix locking in ieee80211_sta_tear_down_BA_sessions, from Johannes
Berg.
4) Add missing barriers to ptr_ring, from Michael S. Tsirkin.
5) Don't advertise gigabit in sh_eth when not available, from Thomas
Petazzoni.
6) Check network namespace when delivering to netlink taps, from Kevin
Cernekee.
7) Kill a race in raw_sendmsg(), from Mohamed Ghannam.
8) Use correct address in TCP md5 lookups when replying to an incoming
segment, from Christoph Paasch.
9) Add schedule points to BPF map alloc/free, from Eric Dumazet.
10) Don't allow silly mtu values to be used in ipv4/ipv6 multicast, also
from Eric Dumazet.
11) Fix SKB leak in tipc, from Jon Maloy.
12) Disable MAC learning on OVS ports of mlxsw, from Yuval Mintz.
13) SKB leak fix in skB_complete_tx_timestamp(), from Willem de Bruijn.
14) Add some new qmi_wwan device IDs, from Daniele Palmas.
15) Fix static key imbalance in ingress qdisc, from Jiri Pirko.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (76 commits)
net: qcom/emac: Reduce timeout for mdio read/write
net: sched: fix static key imbalance in case of ingress/clsact_init error
net: sched: fix clsact init error path
ip_gre: fix wrong return value of erspan_rcv
net: usb: qmi_wwan: add Telit ME910 PID 0x1101 support
pkt_sched: Remove TC_RED_OFFLOADED from uapi
net: sched: Move to new offload indication in RED
net: sched: Add TCA_HW_OFFLOAD
net: aquantia: Increment driver version
net: aquantia: Fix typo in ethtool statistics names
net: aquantia: Update hw counters on hw init
net: aquantia: Improve link state and statistics check interval callback
net: aquantia: Fill in multicast counter in ndev stats from hardware
net: aquantia: Fill ndev stat couters from hardware
net: aquantia: Extend stat counters to 64bit values
net: aquantia: Fix hardware DMA stream overload on large MRRS
net: aquantia: Fix actual speed capabilities reporting
sock: free skb in skb_complete_tx_timestamp on error
s390/qeth: update takeover IPs after configuration change
s390/qeth: lock IP table while applying takeover changes
...
Add an entry for the builtin PHYs present in the Broadcom BCM5395 switch. This
allows us to retrieve the PHY statistics among other things.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Chris Healy <cphealy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull locking fixes from Ingo Molnar:
"Misc fixes:
- Fix a S390 boot hang that was caused by the lock-break logic.
Remove lock-break to begin with, as review suggested it was
unreasonably fragile and our confidence in its continued good
health is lower than our confidence in its removal.
- Remove the lockdep cross-release checking code for now, because of
unresolved false positive warnings. This should make lockdep work
well everywhere again.
- Get rid of the final (and single) ACCESS_ONCE() straggler and
remove the API from v4.15.
- Fix a liblockdep build warning"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools/lib/lockdep: Add missing declaration of 'pr_cont()'
checkpatch: Remove ACCESS_ONCE() warning
compiler.h: Remove ACCESS_ONCE()
tools/include: Remove ACCESS_ONCE()
tools/perf: Convert ACCESS_ONCE() to READ_ONCE()
locking/lockdep: Remove the cross-release locking checks
locking/core: Remove break_lock field when CONFIG_GENERIC_LOCKBREAK=y
locking/core: Fix deadlock during boot on systems with GENERIC_LOCKBREAK
sctp_ifwdtsn_skip, sctp_ifwdtsn_hdr and sctp_ifwdtsn_chunk are used to
define and parse I-FWD TSN chunk format, and sctp_make_ifwdtsn is a
function to build the chunk.
The I-FORWARD-TSN Chunk Format is defined in section 2.3.1 of RFC8260.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo R. Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for SFF modules, which are soldered down SFP modules.
These have a different phys_id value, and also have the present and
rate select signals omitted compared with their socketed counter-parts.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes an issue in two recent commits that may cause
pm_runtime_enable() to be called for too many times for some
devices during the "thaw" transition belonging to hibernation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJaMyQRAAoJEILEb/54YlRx8C0P/j4/LO0dLY4bNPVfeMDpZrcz
iXozHkIg2wKztuO7j+A3sRhShzu90wGjYsN9t3rrEsdqQn5GlB98vU+uak6heyRI
kY/yLPYMnzyZbJ7Rlix6BtZgykhs7w20qxTMuy2ZV7AbeZwODVlm9swDswLcpI1D
tZwbbj+g47I0uC2cuhc0GlcFWcTEyBp9RjsdRKbrFnrDPDOBhJg2MkQwPZ1deG0a
q5PmTI0uskrqJif5ovEhMgQ24pE4C6t88HZj5Dg8JO3H4vaeNrTgHSFE8UjID3fM
PH2DspE9vyTa0irxEcLQmq6rzHSZZkqdirzPn5HP8gz6zQVBMeAe277r6P2DrxxJ
vygiG0pnDsntiA3ygfVMkVr/pL35Y6TPy2kJpKN1+LUDC5If4iObW91qu0CN5pWJ
QxUS76TIGXTEIPhHLiy+OKPFTAYAADSeZAaedhdHLE2a3iPGB6TxT4RlPU0h3txD
4+aQ4cbi2cyGTbPD83jaoKbzL5oRUxnnZYGJEXu/Opbay8LDTSHaTTYxAohTMUj5
46eBEJ+hX8SDqVeAlX1UL37Wjx0zHYPyJSjYFIm5Imr/JovftG6X7lCDlFBA86VN
DKw5Swc12QmaGmC9vQPMUwtn/QN2R0WCBm6Hisi/3OIu2/+pY/xqtBI2eEgExsgZ
PYVvmIwqAl2+M6HcweqJ
=0D8g
-----END PGP SIGNATURE-----
Merge tag 'pm-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"This fixes an issue in two recent commits that may cause
pm_runtime_enable() to be called for too many times for some devices
during the "thaw" transition belonging to hibernation"
* tag 'pm-4.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM / sleep: Avoid excess pm_runtime_enable() calls in device_resume()
- Comment fixes
- Build fix
- Better memory alloction (don't use NR_CPUS)
- Configuration fix
- Build warning fix
- Enhanced callback parameter (to simplify users of trace hooks)
- Give up on stack tracing when RCU isn't watching (it's a lost cause)
-----BEGIN PGP SIGNATURE-----
iQHIBAABCgAyFiEEPm6V/WuN2kyArTUe1a05Y9njSUkFAlozKm0UHHJvc3RlZHRA
Z29vZG1pcy5vcmcACgkQ1a05Y9njSUmhXwv7BEY923K3Nl3qC6LeYmNyrZ4g1PsD
nbZ+ZjU3KlMPugGbnJCJbfsS0utUp2Wd9gHT32O4BUf0/Pxjo3utXvkzRQJ3SwHT
X7QhXROkicAKRFrPxj0BaiLexC+yJR23wGp2YUVHLO4Aa/ptN8BJvH22+eDpsCLc
f1DWJdvdbyPBUeoHNKevjvccsUYMlnBfe1jhJ9nRWHnq1axGV3bllcd6v4T07LgK
LO28Krp4/V3tVN9Sq6jBoGTrULf5O1xtuBDVtANeXdha1oMUTYr4TUzTuOQxFgNu
sCIpUWTKu+PNGmU5bhlryb20C2+PveE4EMK0InVVlqrhYJ/XjXFyRaZYYkM2NAKq
XmeHVjMfc6Wmrd/nepuaTZvGGrK2kK/pCX/XuQthKKjAU6rO5X+FfGiSGodLPIYB
1m+QvfX8Re2IJkswm3lS68LKtG7SnjcSB9sY6PhVe6C6oP2ya2O1hthJ+TkrfNbc
lhE5HzfCJZ9ujSjcQoUGwjrhnIezQX4KnfJZ
=WdMQ
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various fix-ups:
- comment fixes
- build fix
- better memory alloction (don't use NR_CPUS)
- configuration fix
- build warning fix
- enhanced callback parameter (to simplify users of trace hooks)
- give up on stack tracing when RCU isn't watching (it's a lost
cause)"
* tag 'trace-v4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have stack trace not record if RCU is not watching
tracing: Pass export pointer as argument to ->write()
ring-buffer: Remove unused function __rb_data_page_index()
tracing: make PREEMPTIRQ_EVENTS depend on TRACING
tracing: Allocate mask_str buffer dynamically
tracing: always define trace_{irq,preempt}_{enable_disable}
tracing: Fix code comments in trace.c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaMwhcAAoJEFmIoMA60/r88ygP/3fuGrYrolx1yOM/XKjiG1Za
B+FF5IK0xcV4Xrdg4K9TrqCkVBlSmTkF2wvD1raFQhvPvzNnTMI0JvI5iwUogSlA
EI8rKvKRT0QRtT27URhGWCZOrPMgylE7Vtq7ZGLin2o19A4JQw3TC8mz9QirSbii
C5H+WZ6iAQ3ISjOuiFUREqQDCE8FzbYQC69uh/xMRshZFG1WPsX96ZRSvVfPQUeC
XwvKqiO4s9I7zyxhS/WSRnnEHO7LNSRRiq8kzP0VKKMAfTo8gLYPoeIPakvXSsQZ
j7N1ycWub9DSn7wjh9Zd922wp1EYfxkCW6jilVOd+BlmOxLjyPPGgahbNIrIojwa
qiTY5SNdty3h27IT3Hep4yDBoqhXXPORXgh7QRlzizu8Te2SjHY7n9qi1S8Uobn+
s98PoNYU+oelivi9MtKLKcqCaNL0RE8aCQfuA3TzCU/dzR4Trx8obV50f67vsL6e
eK67S/KpOVUjHQmGec8iqnrJfHPZkxgaRTOVVYCXKidtJxTf/LyjDKYh8A2Q27A5
wXdRaSxLw25r0alAWzUEIVj/iUPd4nYKY9RTeRPWiXU+wsEUcBO+js6HE1xZAAfk
ULDqZKnkKsgW9aWlQU6DzzLsCjxPTRcC8SfK79K47Otua2laABzL47mzo2sSKKFs
mNq3i4OiNFq8XCsN18lo
=o3Rn
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.15-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- add a pci_get_domain_bus_and_slot() stub for the CONFIG_PCI=n case to
avoid build breakage in the v4.16 merge window if a
pci_get_bus_and_slot() -> pci_get_domain_bus_and_slot() patch gets
merged before the PCI tree (Randy Dunlap)
- fix an AMD boot regression in the 64bit BAR support added in v4.15
(Christian König)
- fix an R-Car use-after-free that causes a crash if no PCIe card is
present (Geert Uytterhoeven)
* tag 'pci-v4.15-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: rcar: Fix use-after-free in probe error path
x86/PCI: Only enable a 64bit BAR on single-socket AMD Family 15h
x86/PCI: Fix infinite loop in search for 64bit BAR placement
PCI: Add pci_get_domain_bus_and_slot() stub
David Rientjes has reported the following memory corruption while the
oom reaper tries to unmap the victims address space
BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
file: (null) fault: (null) mmap: (null) readpage: (null)
CPU: 2 PID: 1001 Comm: oom_reaper
Call Trace:
unmap_page_range+0x1068/0x1130
__oom_reap_task_mm+0xd5/0x16b
oom_reaper+0xff/0x14c
kthread+0xc1/0xe0
Tetsuo Handa has noticed that the synchronization inside exit_mmap is
insufficient. We only synchronize with the oom reaper if
tsk_is_oom_victim which is not true if the final __mmput is called from
a different context than the oom victim exit path. This can trivially
happen from context of any task which has grabbed mm reference (e.g. to
read /proc/<pid>/ file which requires mm etc.).
The race would look like this
oom_reaper oom_victim task
mmget_not_zero
do_exit
mmput
__oom_reap_task_mm mmput
__mmput
exit_mmap
remove_vma
unmap_page_range
Fix this issue by providing a new mm_is_oom_victim() helper which
operates on the mm struct rather than a task. Any context which
operates on a remote mm struct should use this helper in place of
tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
so it is stable in the exit_mmap path.
Debugged by Tetsuo Handa.
Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
Fixes: 2129258024 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: <stable@vger.kernel.org> [4.14]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.
This patch:
- Make groups_sort globally visible.
- Move the call to groups_sort to the modifiers of group_info
- Remove the call to groups_sort from set_groups
Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc-8 warns about using strncpy() with the source size as the limit:
fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]
This is indeed slightly suspicious, as it protects us from source
arguments without NUL-termination, but does not guarantee that the
destination is terminated.
This keeps the strncpy() to ensure we have properly padded target
buffer, but ensures that we use the correct length, by passing the
actual length of the destination buffer as well as adding a build-time
check to ensure it is exactly TASK_COMM_LEN.
There are only 23 callsites which I all reviewed to ensure this is
currently the case. We could get away with doing only the check or
passing the right length, but it doesn't hurt to do both.
Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The hardened strlen() function causes rather large stack usage in at
least one file in the kernel, in particular when CONFIG_KASAN is
enabled:
drivers/media/usb/em28xx/em28xx-dvb.c: In function 'em28xx_dvb_init':
drivers/media/usb/em28xx/em28xx-dvb.c:2062:1: error: the frame size of 3256 bytes is larger than 204 bytes [-Werror=frame-larger-than=]
Analyzing this problem led to the discovery that gcc fails to merge the
stack slots for the i2c_board_info[] structures after we strlcpy() into
them, due to the 'noreturn' attribute on the source string length check.
I reported this as a gcc bug, but it is unlikely to get fixed for gcc-8,
since it is relatively easy to work around, and it gets triggered
rarely. An earlier workaround I did added an empty inline assembly
statement before the call to fortify_panic(), which works surprisingly
well, but is really ugly and unintuitive.
This is a new approach to the same problem, this time addressing it by
not calling the 'extern __real_strnlen()' function for string constants
where __builtin_strlen() is a compile-time constant and therefore known
to be safe.
We do this by checking if the last character in the string is a
compile-time constant '\0'. If it is, we can assume that strlen() of
the string is also constant.
As a side-effect, this should also improve the object code output for
any other call of strlen() on a string constant.
[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/20171205215143.3085755-1-arnd@arndb.de
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82365
Link: https://patchwork.kernel.org/patch/9980413/
Link: https://patchwork.kernel.org/patch/9974047/
Fixes: 6974f0c455 ("include/linux/string.h: add the option of fortified string.h functions")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Martin Wilck <mwilck@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a variant of rbtree_replace_node() that maintains the leftmost cache
of struct rbtree_root_cached when replacing nodes within the rbtree.
As drm_mm is the only rb_replace_node() being used on an interval tree,
the mistake looks fairly self-contained. Furthermore the only user of
drm_mm_replace_node() is its testsuite...
Testcase: igt/drm_mm/replace
Link: http://lkml.kernel.org/r/20171122100729.3742-1-chris@chris-wilson.co.uk
Link: https://patchwork.freedesktop.org/patch/msgid/20171109212435.9265-1-chris@chris-wilson.co.uk
Fixes: f808c13fd3 ("lib/interval_tree: fast overlap detection")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The <linux/bug.h> was removed from radix-tree.h by commit f5bba9d11a
("include/linux/radix-tree.h: remove unneeded #include <linux/bug.h>").
Since that commit, tools/testing/radix-tree/ couldn't pass compilation
due to tools/testing/radix-tree/idr.c:17: undefined reference to
WARN_ON_ONCE. This patch adds the bug.h header to idr.h to solve the
issue.
Link: http://lkml.kernel.org/r/1511963726-34070-2-git-send-email-wei.w.wang@intel.com
Fixes: f5bba9d11a ("include/linux/radix-tree.h: remove unneeded #include <linux/bug.h>")
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f371b304f1 ("bpf/tracing: allow user space to
query prog array on the same tp") introduced a perf
ioctl command to query prog array attached to the
same perf tracepoint. The commit introduced a
compilation error under certain config conditions, e.g.,
(1). CONFIG_BPF_SYSCALL is not defined, or
(2). CONFIG_TRACING is defined but neither CONFIG_UPROBE_EVENTS
nor CONFIG_KPROBE_EVENTS is defined.
Error message:
kernel/events/core.o: In function `perf_ioctl':
core.c:(.text+0x98c4): undefined reference to `bpf_event_query_prog_array'
This patch fixed this error by guarding the real definition under
CONFIG_BPF_EVENTS and provided static inline dummy function
if CONFIG_BPF_EVENTS was not defined.
It renamed the function from bpf_event_query_prog_array to
perf_event_query_prog_array and moved the definition from linux/bpf.h
to linux/trace_events.h so the definition is in proximity to
other prog_array related functions.
Fixes: f371b304f1 ("bpf/tracing: allow user space to query prog array on the same tp")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
phylink_get_fixed_state() currently consults an optional "link_gpio"
GPIO descriptor, expand this mechanism to allow specifying a custom
callback. This is necessary to support out of band link notifcation
(e.g: from an interrupt within a MMIO register).
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to let subsystems like DSA fully utilize PHYLINK, we need to be able
to communicate phy_device::flags from of_phy_{connect,attach} even when using
PHYLINK APIs.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add pcim_set_mwi(), a device-managed version of pci_set_mwi().
First user is the Realtek r8169 driver.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some PHYs need the refclk to be a continuous clock. Therefore they don't
allow turning it off and on again during operation. Nonetheless such a
clock switching is performed by some ETH drivers (namely FEC [1]) for
power saving reasons. An example for an affected PHY is the
SMSC/Microchip LAN8720 in "REF_CLK In Mode".
In order to provide a uniform method to overcome this problem this patch
adds a new phy_driver flag (PHY_RST_AFTER_CLK_EN) and corresponding
function phy_reset_after_clk_enable() to the phylib. These should be
used to trigger reset of the PHY after the refclk is switched on again.
[1] commit e8fcfcd568 ("net: fec: optimize the clock management to save power")
Signed-off-by: Richard Leitner <richard.leitner@skidata.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some PHYs need a minimum time after the reset gpio was asserted and/or
deasserted. To ensure we meet these timing requirements add two new
optional devicetree parameters for the phy: reset-delay-us and
reset-post-delay-us.
Signed-off-by: Richard Leitner <richard.leitner@skidata.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Error injection is sloppy and very ad-hoc. BPF could fill this niche
perfectly with it's kprobe functionality. We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations. Accomplish this with the bpf_override_funciton
helper. This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function. This gives us a nice
clean way to implement systematic error injection for all of our code
paths.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Using BPF we can override kprob'ed functions and return arbitrary
values. Obviously this can be a bit unsafe, so make this feature opt-in
for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Commit e87c6bc385 ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).
Commit 2541517c32 ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.
The new uapi ioctl command:
PERF_EVENT_IOC_QUERY_BPF
The new uapi/linux/perf_event.h structure:
struct perf_event_query_bpf {
__u32 ids_len;
__u32 prog_cnt;
__u32 ids[0];
};
User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".
The usage:
struct perf_event_query_bpf *query =
malloc(sizeof(*query) + sizeof(u32) * ids_len);
query.ids_len = ids_len;
err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, query);
if (err == 0) {
/* query.prog_cnt is the number of available progs,
* number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
*/
} else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
* query.prog_cnt is the number of available progs
*/
} else {
/* other errors */
}
Signed-off-by: Yonghong Song <yhs@fb.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When using large tcp_rmem[2] values (I did tests with 500 MB),
I noticed overflows while computing rcvwin.
Lets fix this before the following patch.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are no longer any kernelspace uses of ACCESS_ONCE(), so we can
remove the definition from <linux/compiler.h>.
This patch removes the ACCESS_ONCE() definition, and updates comments
which referred to it. At the same time, some inconsistent and redundant
whitespace is removed from comments.
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: apw@canonical.com
Link: http://lkml.kernel.org/r/20171127103824.36526-4-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This code (CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y),
while it found a number of old bugs initially, was also causing too many
false positives that caused people to disable lockdep - which is arguably
a worse overall outcome.
If we disable cross-release by default but keep the code upstream then
in practice the most likely outcome is that we'll allow the situation
to degrade gradually, by allowing entropy to introduce more and more
false positives, until it overwhelms maintenance capacity.
Another bad side effect was that people were trying to work around
the false positives by uglifying/complicating unrelated code. There's
a marked difference between annotating locking operations and
uglifying good code just due to bad lock debugging code ...
This gradual decrease in quality happened to a number of debugging
facilities in the kernel, and lockdep is pretty complex already,
so we cannot risk this outcome.
Either cross-release checking can be done right with no false positives,
or it should not be included in the upstream kernel.
( Note that it might make sense to maintain it out of tree and go through
the false positives every now and then and see whether new bugs were
introduced. )
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_GENERIC_LOCKBEAK=y, locking structures grow an extra int ->break_lock
field which is used to implement raw_spin_is_contended() by setting the field
to 1 when waiting on a lock and clearing it to zero when holding a lock.
However, there are a few problems with this approach:
- There is a write-write race between a CPU successfully taking the lock
(and subsequently writing break_lock = 0) and a waiter waiting on
the lock (and subsequently writing break_lock = 1). This could result
in a contended lock being reported as uncontended and vice-versa.
- On machines with store buffers, nothing guarantees that the writes
to break_lock are visible to other CPUs at any particular time.
- READ_ONCE/WRITE_ONCE are not used, so the field is potentially
susceptible to harmful compiler optimisations,
Consequently, the usefulness of this field is unclear and we'd be better off
removing it and allowing architectures to implement raw_spin_is_contended() by
providing a definition of arch_spin_is_contended(), as they can when
CONFIG_GENERIC_LOCKBREAK=n.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1511894539-7988-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
sctp_idatahdr and sctp_idata_chunk are used to define and parse
I-DATA chunk format, and sctp_make_idata is a function to build
the chunk.
The I-DATA Chunk Format is defined in section 2.1 of RFC8260.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
asoc intl_enable will be set when local sp strm_interleave is set
and there's I-DATA chunk in init and init_ack extensions, as said
in section 2.2.1 of RFC8260.
asoc intl_enable indicates all data will be sent as I-DATA chunks.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Users of ptr_ring expect that it's safe to give the
data structure a pointer and have it be available
to consumers, but that actually requires an smb_wmb
or a stronger barrier.
In absence of such barriers and on architectures that reorder writes,
consumer might read an un=initialized value from an skb pointer stored
in the skb array. This was observed causing crashes.
To fix, add memory barriers. The barrier we use is a wmb, the
assumption being that producers do not need to read the value so we do
not need to order these reads.
Reported-by: George Cherian <george.cherian@cavium.com>
Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add two new library functions: alloc_bucket_spinlocks and
free_bucket_spinlocks. These are used to allocate and free an array
of spinlocks that are useful as locks for hash buckets. The interface
specifies the maximum number of spinlocks in the array as well
as a CPU multiplier to derive the number of spinlocks to allocate.
The number allocated is rounded up to a power of two to make the
array amenable to hash lookup.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Split out most of rht_key_hashfn which is calculating the hash into
its own function. This way the hash function can be called separately to
get the hash value.
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function is like rhashtable_walk_next except that it only returns
the current element in the inter and does not advance the iter.
This patch also creates __rhashtable_walk_find_next. It finds the next
element in the table when the entry cached in iter is NULL or at the end
of a slot. __rhashtable_walk_find_next is called from
rhashtable_walk_next and rhastable_walk_peek.
end_of_table is an added field to the iter structure. This indicates
that the end of table was reached (walker.tbl being NULL is not a
sufficient condition for end of table).
Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:
ret = rhashtable_walk_start(rhiter);
if (ret && ret != -EAGAIN)
goto out;
Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.
This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.
Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Middle-layer code doing suspend-time optimizations for devices with
the DPM_FLAG_SMART_SUSPEND flag set (currently, the PCI bus type and
the ACPI PM domain) needs to make the core skip ->thaw_early and
->thaw callbacks for those devices in some cases and it sets the
power.direct_complete flag for them for this purpose.
However, it turns out that setting power.direct_complete outside of
the PM core is a bad idea as it triggers an excess invocation of
pm_runtime_enable() in device_resume().
For this reason, provide a helper to clear power.is_late_suspended
and power.is_suspended to be invoked by the middle-layer code in
question instead of setting power.direct_complete and make that code
call the new helper.
Fixes: c4b65157ae (PCI / PM: Take SMART_SUSPEND driver flag into account)
Fixes: 05087360fd (ACPI / PM: Take SMART_SUSPEND driver flag into account)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
ARM:
* A number of issues in the vgic discovered using SMATCH
* A bit one-off calculation in out stage base address mask (32-bit and
64-bit)
* Fixes to single-step debugging instructions that trap for other
reasons such as MMMIO aborts
* Printing unavailable hyp mode as error
* Potential spinlock deadlock in the vgic
* Avoid calling vgic vcpu free more than once
* Broken bit calculation for big endian systems
s390:
* SPDX tags
* Fence storage key accesses from problem state
* Make sure that irq_state.flags is not used in the future
x86:
* Intercept port 0x80 accesses to prevent host instability (CVE)
* Use userspace FPU context for guest FPU (mainly an optimization that
fixes a double use of kernel FPU)
* Do not leak one page per module load
* Flush APIC page address cache from MMU invalidation notifiers
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJaLA93AAoJEED/6hsPKofo9msH/2DrqT2FOKfLuxNR2FeUGWr3
lqFoBRUXrVDMINGStnWrV36h/xYzlgJl9jtSDS8dr3VxLqtrNLlDg9NmGeogoZ+k
/xewr/jFYoSRfffsvrbkzORUfvu6zqvJwufiwBEJwAfcswiLqPizdFXcxtUL4eZE
9s9sIweo5zp2Xjg5yLOEkyanePKMEht/81zPkHyM+g0ZMoaPam3qZHA0lLzdyRgd
G9LpSyiMFHguYYgbwipaVue3zgMY1EdmKQ8C2hEPmZd8nVau26YDwRnAwwLrmVkW
sFhGO1Xi18TzQPokzALC25c9v0fqgxL5+fNyFNgWwTc2n9PSwO+IHcy699UH+3A=
=Qcqd
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"ARM:
- A number of issues in the vgic discovered using SMATCH
- A bit one-off calculation in out stage base address mask (32-bit
and 64-bit)
- Fixes to single-step debugging instructions that trap for other
reasons such as MMMIO aborts
- Printing unavailable hyp mode as error
- Potential spinlock deadlock in the vgic
- Avoid calling vgic vcpu free more than once
- Broken bit calculation for big endian systems
s390:
- SPDX tags
- Fence storage key accesses from problem state
- Make sure that irq_state.flags is not used in the future
x86:
- Intercept port 0x80 accesses to prevent host instability (CVE)
- Use userspace FPU context for guest FPU (mainly an optimization
that fixes a double use of kernel FPU)
- Do not leak one page per module load
- Flush APIC page address cache from MMU invalidation notifiers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
KVM: x86: fix APIC page invalidation
KVM: s390: Fix skey emulation permission check
KVM: s390: mark irq_state.flags as non-usable
KVM: s390: Remove redundant license text
KVM: s390: add SPDX identifiers to the remaining files
KVM: VMX: fix page leak in hardware_setup()
KVM: VMX: remove I/O port 0x80 bypass on Intel hosts
x86,kvm: remove KVM emulator get_fpu / put_fpu
x86,kvm: move qemu/guest FPU switching out to vcpu_run
KVM: arm/arm64: Fix broken GICH_ELRSR big endian conversion
KVM: arm/arm64: kvm_arch_destroy_vm cleanups
KVM: arm/arm64: Fix spinlock acquisition in vgic_set_owner
kvm: arm: don't treat unavailable HYP mode as an error
KVM: arm/arm64: Avoid attempting to load timer vgic state without a vgic
kvm: arm64: handle single-step of hyp emulated mmio instructions
kvm: arm64: handle single-step during SError exceptions
kvm: arm64: handle single-step of userspace mmio instructions
kvm: arm64: handle single-stepping trapped instructions
KVM: arm/arm64: debug: Introduce helper for single-step
arm: KVM: Fix VTTBR_BADDR_MASK BUG_ON off-by-one
...
Commit 4675ff05de ("kmemcheck: rip it out") has removed the code but
for some reason SPDX header stayed in place. This looks like a rebase
mistake in the mmotm tree or the merge mistake. Let's drop those
leftovers as well.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
drivers).
2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
to userspace and broke some apps. From Johannes Berg.
3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
Wang.
4) Gianfar MAC can't do EEE so don't advertise it by default, from
Claudiu Manoil.
5) Relax strict netlink attribute validation, but emit a warning. From
David Ahern.
6) Fix regression in checksum offload of thunderx driver, from Florian
Westphal.
7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.
8) New card support in iwlwifi, from Ihab Zhaika.
9) BBR congestion control bug fixes from Neal Cardwell.
10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.
11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.
12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.
13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
net: mvpp2: fix the RSS table entry offset
tcp: evaluate packet losses upon RTT change
tcp: fix off-by-one bug in RACK
tcp: always evaluate losses in RACK upon undo
tcp: correctly test congestion state in RACK
bnxt_en: Fix sources of spurious netpoll warnings
tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
tcp_bbr: reset full pipe detection on loss recovery undo
tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
sfc: pass valid pointers from efx_enqueue_unwind
gianfar: Disable EEE autoneg by default
tcp: invalidate rate samples during SACK reneging
can: peak/pcie_fd: fix potential bug in restarting tx queue
can: usb_8dev: cancel urb on -EPIPE and -EPROTO
can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
can: esd_usb2: cancel urb on -EPIPE and -EPROTO
can: ems_usb: cancel urb on -EPIPE and -EPROTO
can: mcba_usb: cancel urb on -EPROTO
usbnet: fix alignment for frames with no ethernet header
tcp: use current time in tcp_rcv_space_adjust()
...
Move 'macaddr_count' from after 'netpoll' to after 'nest_level' to pack
and reduce a memory hole.
Fixes: 88ca59d1aa (macvlan: remove unused fields in struct macvlan_dev)
Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ALR table operations are a sequence of related register operations which
should be protected from concurrent access. The alr_cache should also be
protected. Add alr_mutex doing that.
Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>