Commit graph

49711 commits

Author SHA1 Message Date
Amir Vadai
43a335e055 net/mlx5_core: Flow counters infrastructure
If a counter has the aging flag set when created, it is added to a list
of counters that will be queried periodically from a workqueue.  query
result and last use timestamp are cached.
add/del counter must be very efficient since thousands of such
operations might be issued in a second.
There is only a single reference to counters without aging, therefore
no need for locks.
But, counters with aging enabled are stored in a list. In order to make
code as lockless as possible, all the list manipulation and access to
hardware is done from a single context - the periodic counters query
thread.

The hardware supports multiple counters per FTE, however currently we
are using one counter for each FTE.

Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 13:43:51 -04:00
Amir Vadai
bd5251dbf1 net/mlx5_core: Introduce flow steering destination of type counter
When adding a flow steering rule with a counter, need to supply a
destination of type MLX5_FLOW_DESTINATION_TYPE_COUNTER, with a pointer
to a struct mlx5_fc.
Also, MLX5_FLOW_CONTEXT_ACTION_COUNT bit should be set in the action.

Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 13:43:51 -04:00
Amir Vadai
9dc0b289c4 net/mlx5_core: Firmware commands to support flow counters
Getting packet/byte statistics on flows is done through flow counters.
Implement the firmware commands to alloc, free and query flow counters.

Signed-off-by: Amir Vadai <amirva@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 13:43:51 -04:00
David S. Miller
909b27f706 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next'
because we no longer have a per-netns conntrack hash.

The ip_gre.c conflict as well as the iwlwifi ones were cases of
overlapping changes.

Conflicts:
	drivers/net/wireless/intel/iwlwifi/mvm/tx.c
	net/ipv4/ip_gre.c
	net/netfilter/nf_conntrack_core.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-15 13:32:48 -04:00
Linus Torvalds
6ba5b85fd4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Overlayfs fixes from Miklos, assorted fixes from me.

  Stable fodder of varying severity, all sat in -next for a while"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ovl: ignore permissions on underlying lookup
  vfs: add lookup_hash() helper
  vfs: rename: check backing inode being equal
  vfs: add vfs_select_inode() helper
  get_rock_ridge_filename(): handle malformed NM entries
  ecryptfs: fix handling of directory opening
  atomic_open(): fix the handling of create_error
  fix the copy vs. map logics in blk_rq_map_user_iov()
  do_splice_to(): cap the size before passing to ->splice_read()
2016-05-14 11:59:43 -07:00
Linus Torvalds
1410b74e40 Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "During v4.6-rc1 cgroup namespace support was merged.  There is an
  issue where it's impossible to tell whether a given cgroup mount point
  is bind mounted or namespaced.  Serge has been working on the issue
  but it took longer than expected to resolve, so the late pull request.

  Given that it's a completely new feature and the patches don't touch
  anything else, the risk seems acceptable.  However, if this is too
  late, an alternative is plugging new cgroup ns creation for v4.6 and
  retrying for v4.7"

* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: fix compile warning
  kernfs: kernfs_sop_show_path: don't return 0 after seq_dentry call
  cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
  kernfs_path_from_node_locked: don't overwrite nlen
2016-05-13 16:26:46 -07:00
Linus Torvalds
c3548b730d regulator: Fixes for v4.6
A small collection of driver specific fixes for the regulator
 subsysetem:
 
  - Fix handling of probe deferral for GPIO regulators.
  - Fix a typo in the module alias for DA9053.
  - Fix the definition of BUCK9 in the S2MPS11 driver.  This change looks
    larger than it is because an irregularity in the hardware means that
    the macro used to define bucks 6-10 needs duplicating and tweaking
    to have a separate macro for 9.
  - Fix a series of errors in the definitions of the LDOs the AXP20x
    regulators, some of which had always been present and some of which
    were introduced in the merge window.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXNazxAAoJECTWi3JdVIfQd9EH/0x/3Era4ctuLFP/dKVg+lAQ
 PDwgGRlHg1RVyEzzy1ZWqew//P7hgBo6urwFHjcczBE95e0bBhhMJERGnhya07uq
 eWYfPARl1zHxEm0gOfLBDaFAW8YIaNyXiEsW/aTeViNIL9/HIAd/ZTZhJiDLPiAF
 FdNs3yzBuAR5SHsnvCTXZY1d20LeQC6AC2nWUo2WrcyKKZZ0HamrSLP0lMq0OHVQ
 n2h/8f1g9TcykakCStB/4D8vLKMt0OYmalh3Y/C/JGkWnhMoBTLRgktrNFh6lnh9
 E0SnMXSuZbC9+mv8MLRHQft6XagDYzNq+odwSb9tfqbXqBdncOpvvVr6hLmzxGk=
 =sVmG
 -----END PGP SIGNATURE-----

Merge tag 'regulator-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator

Pull regulator fixes from Mark Brown:
 "A small collection of driver specific fixes for the regulator
  subsysetem:

   - Fix handling of probe deferral for GPIO regulators

   - Fix a typo in the module alias for DA9053

   - Fix the definition of BUCK9 in the S2MPS11 driver.  This change
     looks larger than it is because an irregularity in the hardware
     means that the macro used to define bucks 6-10 needs duplicating
     and tweaking to have a separate macro for 9

   - Fix a series of errors in the definitions of the LDOs the AXP20x
     regulators, some of which had always been present and some of which
     were introduced in the merge window"

* tag 'regulator-fix-v4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: da9063: Correct module alias prefix to fix module autoloading
  regulator: axp20x: Fix axp22x ldo_io registration error on cold boot
  regulator: axp20x: Fix axp22x ldo_io voltage ranges
  regulator: axp20x: Fix LDO4 linear voltage range
  regulator: s2mps11: Fix invalid selector mask and voltages for buck9
  regulator: gpio: check return value of of_get_named_gpio
2016-05-13 09:46:00 -07:00
Mark Brown
9689dab30a Merge remote-tracking branches 'regulator/fix/axp20x', 'regulator/fix/da9063', 'regulator/fix/gpio' and 'regulator/fix/s2mps11' into regulator-linus 2016-05-13 11:11:08 +01:00
Andrea Arcangeli
6d0a07edd1 mm: thp: calculate the mapcount correctly for THP pages during WP faults
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.

total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.

This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.

Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.

Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.

Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.

Mike Marciniszyn said:

: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit 61f5d698cc ("mm: re-enable THP")
:
: The test program registers two rather large MRs (512M) and RDMA
: writes data to a passive peer using the first and RDMA reads it back
: into the second MR and compares that data.  The sizes are chosen randomly
: between 0 and 1024 bytes.
:
: The test will get through a few (<= 4 iterations) and then gets a
: compare error.
:
: Tracing indicates the kernel logical addresses associated with the individual
: pages at registration ARE correct , the data in the "RDMA read response only"
: packets ARE correct.
:
: The "corruption" occurs when the packet crosse two pages that are not physically
: contiguous.   The second page reads back as zero in the program.
:
: It looks like the user VA at the point of the compare error no longer points to
: the same physical address as was registered.
:
: This patch totally resolves the issue!

Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reviewed-by: Dean Luick <dean.luick@intel.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Tested-by: Josh Collier <josh.d.collier@intel.com>
Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
Cc: <stable@vger.kernel.org>	[4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-12 15:52:50 -07:00
Yuval Mintz
831bfb0e88 qed*: Tx-switching configuration
Device should be configured by default to VEB once VFs are active.
This changes the configuration of both PFs' and VFs' vports into enabling
tx-switching once sriov is enabled.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:08 -04:00
Yuval Mintz
73390ac9d8 qed*: support ndo_get_vf_config
Allows the user to view the VF configuration by observing the PF's
device.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:08 -04:00
Yuval Mintz
6ddc760825 qed*: IOV support spoof-checking
Add support in `ndo_set_vf_spoofchk' for allowing PF control over
its VF spoof-checking configuration.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:08 -04:00
Yuval Mintz
733def6a04 qed*: IOV link control
This adds support in 2 ndo that allow PF to tweak the VF's view of the
link - `ndo_set_vf_link_state' to allow it a view independent of the PF's,
and `ndo_set_vf_rate' which would allow the PF to limit the VF speed.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:08 -04:00
Yuval Mintz
eff169608c qed*: Support forced MAC
Allows the PF to enforce the VF's mac.
i.e., by using `ip link ... vf <x> mac <value>'.

While a MAC is forced, PF would prevent the VF from configuring any other
MAC.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:08 -04:00
Yuval Mintz
08feecd7fc qed*: Support PVID configuration
This adds support for PF control over the VF vlan configuration.
I.e., `ip link ... vf <x> vlan <vid>' should now be supported.

 1. <vid> != 0 => VF receives [unknowingly] only traffic tagged by
    <vid> and tags all outgoing traffic sent by VF with <vid>.
 2. <vid> == 0 ==> Remove the pvid configuration, reverting to previous.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:07 -04:00
Yuval Mintz
0b55e27d56 qed: IOV configure and FLR
While previous patches have already added the necessary logic to probe
VFs as well as enabling them in the HW, this patch adds the ability to
support VF FLR & SRIOV disable.

It then wraps both flows together into the first IOV callback to be
provided to the protocol driver - `configure'. This would later to be used
to enable and disable SRIOV in the adapter.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:07 -04:00
Yuval Mintz
1408cc1fa4 qed: Introduce VFs
This adds the qed VFs for the first time -
The vfs are limited functions, with a very different PCI bar structure
[when compared with PFs] to better impose the related security demands
associated with them.

This patch includes the logic neccesary to allow VFs to successfully probe
[without actually adding the ability to enable iov].
This includes diverging all the flows that would occur as part of the pci
probe of the driver, preventing VF from accessing registers/memories it
can't and instead utilize the VF->PF channel to query the PF for needed
information.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:07 -04:00
Yuval Mintz
37bff2b9c6 qed: Add VF->PF channel infrastructure
Communication between VF and PF is based on a dedicated HW channel;
VF will prepare a messge, and by signaling the HW the PF would get a
notification of that message existance. The PF would then copy the
message, process it and DMA an answer back to the VF as a response.

The messages themselves are TLV-based - allowing easier backward/forward
compatibility.

This patch adds the infrastructure of the channel on the PF side -
starting with the arrival of the notification and ending with DMAing
the response back to the VF.

It also adds a dummy-response as reference, as it only lays the
groundwork of the communication; it doesn't really add support of any
actual messages.

Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-12 00:04:07 -04:00
Tariq Toukan
7219ab34f1 net/mlx5e: CQE compression
CQE compression feature is meant to save PCIe bandwidth by
compressing few CQEs into smaller amount of bytes on PCIe.
CQE compression can be selectively enabled per CQ.  By default
is disabled for now and will be enabled later on.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 19:42:39 -04:00
David Ahern
74b20582ac net: l3mdev: Add hook in ip and ipv6
Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.

This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.

dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 19:31:40 -04:00
Al Viro
e4d35be584 Merge branch 'ovl-fixes' into for-linus 2016-05-11 00:00:29 -04:00
Miklos Szeredi
3c9fe8cdff vfs: add lookup_hash() helper
Overlayfs needs lookup without inode_permission() and already has the name
hash (in form of dentry->d_name on overlayfs dentry).  It also doesn't
support filesystems with d_op->d_hash() so basically it only needs
the actual hashed lookup from lookup_one_len_unlocked()

So add a new helper that does unlocked lookup of a hashed name.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-05-10 23:56:28 -04:00
Miklos Szeredi
54d5ca871e vfs: add vfs_select_inode() helper
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
2016-05-10 23:55:01 -04:00
Nicolas Dichtel
1dee3f59a8 block/drbd: align properly u64 in nl messages
The attribute 0 is never used in drbd, so let's use it as pad attribute
in netlink messages. This minimizes the patch.

Note that this patch is only compile-tested.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10 15:43:09 -04:00
Philippe Reynes
9d9a77cee1 net: phy: add phy_ethtool_{get|set}_link_ksettings
Ethtool callbacks {get|set}_link_ksettings are often the same, so
we add two generics functions phy_ethtool_{get|set}_link_ksettings
to avoid writing severals times the same function.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Acked-By: David Decotigny <decot@googlers.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10 15:06:19 -04:00
David S. Miller
7a27de7810 linux-can-next-for-4.7-20160509
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCgAGBQJXMGAcAAoJED07qiWsqSVqR7AH/RTuW5SeDFQGI1YK4U6ekrbg
 +22EDLyUh+MD/eBKf74C9jciaTnd84PAYCOEBa6rXi/2P1gHMnyEIJOxse/cfgKz
 Hf26avGjaTCPS7VFHJeLTSrOlR/Hogl5gp+SEjA4WD1cpr480lS3sgGjax8YTY20
 sNl2xJqnFVjkJAa0f7AsmaZRHsyytvPbS5c8z7RuihhX1yamTPm8BKqY7s4oJ83n
 Rg2/fXV6O1Dg+p/2qra7kyMGj6wIIXOI9wXPjLNXuR6nqT3vWhGaKy+pkl/Ok2JY
 UvwDeb7UvgXcypv5FO3LW9R7vqF5L9ZpqS2XCrlTwoFct7bCOCH1xJFGaXV/Cbo=
 =Eipf
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-4.7-20160509' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2016-05-09

this is a pull request of 12 patches for net-next/master.

Alexander Gerasiov and Nikita Edward Baruzdin each contribute a patch
improving the sja1000 driver. Amitoj Kaur Chawla's patch converts the
mcp251x driver to alloc_workqueue(). A patch by Oliver Hartkopp fixes
the handling of CAN config options. Andreas Gröger improves the error
handling in the janz-ican3 driver. The patch by Maximilian Schneider
for the gs_usb improves probing of the USB driver. Finally there are 6
improvement patches by Marek Vasut for the ifi CAN driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10 14:19:06 -04:00
David S. Miller
e800072c18 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
In netdevice.h we removed the structure in net-next that is being
changes in 'net'.  In macsec.c and rtnetlink.c we have overlaps
between fixes in 'net' and the u64 attribute changes in 'net-next'.

The mlx5 conflicts have to do with vxlan support dependencies.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09 15:59:24 -04:00
Linus Torvalds
26acc792c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Check klogctl failure correctly, from Colin Ian King.

 2) Prevent OOM when under memory pressure in flowcache, from Steffen
    Klassert.

 3) Fix info leak in llc and rtnetlink ifmap code, from Kangjie Lu.

 4) Memory barrier and multicast handling fixes in bnxt_en, from Michael
    Chan.

 5) Endianness bug in mlx5, from Daniel Jurgens.

 6) Fix disconnect handling in VSOCK, from Ian Campbell.

 7) Fix locking of netdev list walking in get_bridge_ifindices(), from
    Nikolay Aleksandrov.

 8) Bridge multicast MLD parser can look at wrong packet offsets, fix
    from Linus Lüssing.

 9) Fix chip hang in qede driver, from Sudarsana Reddy Kalluru.

10) Fix missing setting of encapsulation before inner handling completes
    in udp_offload code, from Jarno Rajahalme.

11) Missing rollbacks during LAG join and flood configuration failures
    in mlxsw driver, from Ido Schimmel.

12) Fix error code checks in netxen driver, from Dan Carpenter.

13) Fix key size in new macsec driver, from Sabrina Dubroca.

14) Fix mlx5/VXLAN dependencies, from Arnd Bergmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (29 commits)
  net/mlx5e: make VXLAN support conditional
  Revert "net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue"
  macsec: key identifier is 128 bits, not 64
  Documentation/networking: more accurate LCO explanation
  macvtap: segmented packet is consumed
  tools: bpf_jit_disasm: check for klogctl failure
  qede: uninitialized variable in qede_start_xmit()
  netxen: netxen_rom_fast_read() doesn't return -1
  netxen: reversed condition in netxen_nic_set_link_parameters()
  netxen: fix error handling in netxen_get_flash_block()
  mlxsw: spectrum: Add missing rollback in flood configuration
  mlxsw: spectrum: Fix rollback order in LAG join failure
  udp_offload: Set encapsulation before inner completes.
  udp_tunnel: Remove redundant udp_tunnel_gro_complete().
  qede: prevent chip hang when increasing channels
  net: ipv6: tcp reset, icmp need to consider L3 domain
  bridge: fix igmp / mld query parsing
  net: bridge: fix old ioctl unlocked net device walk
  VSOCK: do not disconnect socket when peer has shutdown SEND only
  net/mlx4_en: Fix endianness bug in IPV6 csum calculation
  ...
2016-05-09 12:11:37 -07:00
David S. Miller
e8ed77dfa9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following large patchset contains Netfilter updates for your
net-next tree. My initial intention was to send you this in two goes but
when I looked back twice I already had this burden on top of me.

Several updates for IPVS from Marco Angaroni:

1) Allow SIP connections originating from real-servers to be load
   balanced by the SIP persistence engine as is already implemented
   in the other direction.

2) Release connections immediately for One-packet-scheduling (OPS)
   in IPVS, instead of making it via timer and rcu callback.

3) Skip deleting conntracks for each one packet in OPS, and don't call
   nf_conntrack_alter_reply() since no reply is expected.

4) Enable drop on exhaustion for OPS + SIP persistence.

Miscelaneous conntrack updates from Florian Westphal, including fix for
hash resize:

5) Move conntrack generation counter out of conntrack pernet structure
   since this is only used by the init_ns to allow hash resizing.

6) Use get_random_once() from packet path to collect hash random seed
    instead of our compound.

7) Don't disable BH from ____nf_conntrack_find() for statistics,
   use NF_CT_STAT_INC_ATOMIC() instead.

8) Fix lookup race during conntrack hash resizing.

9) Introduce clash resolution on conntrack insertion for connectionless
   protocol.

Then, Florian's netns rework to get rid of per-netns conntrack table,
thus we use one single table for them all. There was consensus on this
change during the NFWS 2015 and, on top of that, it has recently been
pointed as a source of multiple problems from unpriviledged netns:

11) Use a single conntrack hashtable for all namespaces. Include netns
    in object comparisons and make it part of the hash calculation.
    Adapt early_drop() to consider netns.

12) Use single expectation and NAT hashtable for all namespaces.

13) Use a single slab cache for all namespaces for conntrack objects.

14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet
    conntrack counter tells us the table is empty (ie. equals zero).

Fixes for nf_tables interval set element handling, support to set
conntrack connlabels and allow set names up to 32 bytes.

15) Parse element flags from element deletion path and pass it up to the
    backend set implementation.

16) Allow adjacent intervals in the rbtree set type for dynamic interval
    updates.

17) Add support to set connlabel from nf_tables, from Florian Westphal.

18) Allow set names up to 32 bytes in nf_tables.

Several x_tables fixes and updates:

19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch
    from Andrzej Hajda.

And finally, miscelaneous netfilter updates such as:

20) Disable automatic helper assignment by default. Note this proc knob
    was introduced by a900689264 ("netfilter: nf_ct_helper: allow to
    disable automatic helper assignment") 4 years ago to start moving
    towards explicit conntrack helper configuration via iptables CT
    target.

21) Get rid of obsolete and inconsistent debugging instrumentation
    in x_tables.

22) Remove unnecessary check for null after ip6_route_output().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09 15:02:58 -04:00
Josh Poimboeuf
8634de6d25 compiler-gcc: require gcc 4.8 for powerpc __builtin_bswap16()
gcc support for __builtin_bswap16() was supposedly added for powerpc in
gcc 4.6, and was then later added for other architectures in gcc 4.8.

However, Stephen Rothwell reported that attempting to use it on powerpc
in gcc 4.6 fails with:

  lib/vsprintf.c:160:2: error: initializer element is not constant
  lib/vsprintf.c:160:2: error: (near initialization for 'decpair[0]')
  lib/vsprintf.c:160:2: error: initializer element is not constant
  lib/vsprintf.c:160:2: error: (near initialization for 'decpair[1]')
  ...

I'm not entirely sure what those errors mean, but I don't see them on
gcc 4.8.  So let's consider gcc 4.8 to be the official starting point
for __builtin_bswap16().

Arnd Bergmann adds:
 "I found the commit in gcc-4.8 that replaced the powerpc-specific
  implementation of __builtin_bswap16 with an architecture-independent
  one.  Apparently the powerpc version (gcc-4.6 and 4.7) just mapped to
  the lhbrx/sthbrx instructions, so it ended up not being a constant,
  though the intent of the patch was mainly to add support for the
  builtin to x86:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52624

  has the patch that went into gcc-4.8 and more information."

Fixes: 7322dd755e ("byteswap: try to avoid __builtin_constant_p gcc bug")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-09 11:54:29 -07:00
Serge E. Hallyn
4f41fc5962 cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces
Patch summary:

When showing a cgroupfs entry in mountinfo, show the path of the mount
root dentry relative to the reader's cgroup namespace root.

Short explanation (courtesy of mkerrisk):

If we create a new cgroup namespace, then we want both /proc/self/cgroup
and /proc/self/mountinfo to show cgroup paths that are correctly
virtualized with respect to the cgroup mount point.  Previous to this
patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
does not.

Long version:

When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
namespace, and then mounts a new instance of the freezer cgroup, the new
mount will be rooted at /a/b.  The root dentry field of the mountinfo
entry will show '/a/b'.

 cat > /tmp/do1 << EOF
 mount -t cgroup -o freezer freezer /mnt
 grep freezer /proc/self/mountinfo
 EOF

 unshare -Gm  bash /tmp/do1
 > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
 > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer

The task's freezer cgroup entry in /proc/self/cgroup will simply show
'/':

 grep freezer /proc/self/cgroup
 9:freezer:/

If instead the same task simply bind mounts the /a/b cgroup directory,
the resulting mountinfo entry will again show /a/b for the dentry root.
However in this case the task will find its own cgroup at /mnt/a/b,
not at /mnt:

 mount --bind /sys/fs/cgroup/freezer/a/b /mnt
 130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer

In other words, there is no way for the task to know, based on what is
in mountinfo, which cgroup directory is its own.

Example (by mkerrisk):

First, a little script to save some typing and verbiage:

echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
cat /proc/self/mountinfo | grep freezer |
        awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'

Create cgroup, place this shell into the cgroup, and look at the state
of the /proc files:

2653
2653                         # Our shell
14254                        # cat(1)
        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer

Create a shell in new cgroup and mount namespaces. The act of creating
a new cgroup namespace causes the process's current cgroups directories
to become its cgroup root directories. (Here, I'm using my own version
of the "unshare" utility, which takes the same options as the util-linux
version):

Look at the state of the /proc files:

        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /sys/fs/cgroup/freezer

The third entry in /proc/self/cgroup (the pathname of the cgroup inside
the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
is rooted at /a/b in the outer namespace.

However, the info in /proc/self/mountinfo is not for this cgroup
namespace, since we are seeing a duplicate of the mount from the
old mount namespace, and the info there does not correspond to the
new cgroup namespace. However, trying to create a new mount still
doesn't show us the right information in mountinfo:

                                      # propagating to other mountns
        /proc/self/cgroup:      7:freezer:/
        mountinfo:              /a/b    /mnt/freezer

The act of creating a new cgroup namespace caused the process's
current freezer directory, "/a/b", to become its cgroup freezer root
directory. In other words, the pathname directory of the directory
within the newly mounted cgroup filesystem should be "/",
but mountinfo wrongly shows us "/a/b". The consequence of this is
that the process in the cgroup namespace cannot correctly construct
the pathname of its cgroup root directory from the information in
/proc/PID/mountinfo.

With this patch, the dentry root field in mountinfo is shown relative
to the reader's cgroup namespace.  So the same steps as above:

        /proc/self/cgroup:      10:freezer:/a/b
        mountinfo:              /       /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /../..  /sys/fs/cgroup/freezer
        /proc/self/cgroup:      10:freezer:/
        mountinfo:              /       /mnt/freezer

cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
cgroup.procs           freezer.self_freezing    notify_on_release
3164
2653                   # First shell that placed in this cgroup
3164                   # Shell started by 'unshare'
14197                  # cat(1)

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
Tested-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-05-09 12:15:03 -04:00
Oliver Hartkopp
bb208f144c can: fix handling of unmodifiable configuration options
As described in 'can: m_can: tag current CAN FD controllers as non-ISO'
(6cfda7fbeb) it is possible to define fixed configuration options by
setting the according bit in 'ctrlmode' and clear it in 'ctrlmode_supported'.
This leads to the incovenience that the fixed configuration bits can not be
passed by netlink even when they have the correct values (e.g. non-ISO, FD).

This patch fixes that issue and not only allows fixed set bit values to be set
again but now requires(!) to provide these fixed values at configuration time.
A valid CAN FD configuration consists of a nominal/arbitration bittiming, a
data bittiming and a control mode with CAN_CTRLMODE_FD set - which is now
enforced by a new can_validate() function. This fix additionally removed the
inconsistency that was prohibiting the support of 'CANFD-only' controller
drivers, like the RCar CAN FD.

For this reason a new helper can_set_static_ctrlmode() has been introduced to
provide a proper interface to handle static enabled CAN controller options.

Reported-by: Ramesh Shanmugasundaram <ramesh.shanmugasundaram@bp.renesas.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Reviewed-by: Ramesh Shanmugasundaram  <ramesh.shanmugasundaram@bp.renesas.com>
Cc: <stable@vger.kernel.org> # >= 3.18
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2016-05-09 11:07:28 +02:00
Courtney Cavin
bdabad3e36 net: Add Qualcomm IPC router
Add an implementation of Qualcomm's IPC router protocol, used to
communicate with service providing remote processors.

Signed-off-by: Courtney Cavin <courtney.cavin@sonymobile.com>
Signed-off-by: Bjorn Andersson <bjorn.andersson@sonymobile.com>
[bjorn: Cope with 0 being a valid node id and implement RTM_NEWADDR]
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-08 23:46:14 -04:00
Bjorn Andersson
43315f31ad soc: qcom: smd: Introduce compile stubs
Introduce compile stubs for the SMD API, allowing consumers to be
compile tested.

Acked-by: Andy Gross <andy.gross@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-08 23:46:14 -04:00
Jarno Rajahalme
229740c631 udp_offload: Set encapsulation before inner completes.
UDP tunnel segmentation code relies on the inner offsets being set for
an UDP tunnel GSO packet, but the inner *_complete() functions will
set the inner offsets only if 'encapsulation' is set before calling
them.  Currently, udp_gro_complete() sets 'encapsulation' only after
the inner *_complete() functions are done.  This causes the inner
offsets having invalid values after udp_gro_complete() returns, which
in turn will make it impossible to properly segment the packet in case
it needs to be forwarded, which would be visible to the user either as
invalid packets being sent or as packet loss.

This patch fixes this by setting skb's 'encapsulation' in
udp_gro_complete() before calling into the inner complete functions,
and by making each possible UDP tunnel gro_complete() callback set the
inner_mac_header to the beginning of the tunnel payload.

Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Reviewed-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 18:25:26 -04:00
Alexei Starovoitov
db58ba4592 bpf: wire in data and data_end for cls_act_bpf
allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
The bpf helpers that change skb->data need to update data_end pointer as well.
The verifier checks that programs always reload data, data_end pointers
after calls to such bpf helpers.
We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-06 16:01:54 -04:00
Linus Torvalds
01ec716761 Power management and ACPI fixes for v4.6-rc7
- Fix for a recent regression in the intel_pstate driver causing
    it to fail to restore the HWP (HW-managed P-states) configuration
    of the boot CPU after suspend-to-RAM (Rafael Wysocki).
 
  - Fix for two recent regressions in the intel_pstate driver, one
    that can trigger a divide by zero if the driver is accessed via
    sysfs before it manages to take the first sample and one causing
    it to fail to update a structure field used in a trace point, so
    the information coming from it is less useful (Rafael Wysocki).
 
  - Fix for a problem in the sti-cpufreq driver introduced during
    the 4.5 cycle that causes it to break CPU PM in multi-platform
    kernels by registering cpufreq-dt (which subsequently doesn't
    work) unconditionally and preventing the driver that would
    actually work from registering (Sudeep Holla).
 
  - Stable-candidate fix for an ARM64 cpuidle issue causing idle
    state usage counters to be incorrectly updated for idle states
    that were not entered due to errors (James Morse).
 
  - Fix for a recently introduced issue in the OPP (Operating
    Performance Points) framework causing it to print bogus error
    messages for missing optional regulators (Viresh Kumar).
 
  - Fix for a recently introduced issue in the generic device
    properties framework that may cause it to attempt to dereferece
    and invalid pointer in some cases (Heikki Krogerus).
 
  - Fix for a deadlock in the ACPICA core that may be triggered
    by device (eg. Thunderbolt) hotplug (Prarit Bhargava).
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJXLIwPAAoJEILEb/54YlRxT+wP/ROEo/r5IaRZ2k8cphWjsiKk
 k9eDuWBL2KZ29ikghXs/vVY2fMbtQkaDT5h57imsUEKoEzI3MlYA3OkQyffFOcsY
 dz/9EnG6K9Efi6VS1dS1tNCgl45aIeHLCqlVPOBCZ9TwSoAERdNJGqItJdS2YKIA
 +C1LGrWl4UiJ95AOof9PHfKfnWxrnRbpIsB2PbxD0Swe5vfskrHoRWGOAMLJIwpF
 7NvEJ15fryDIvlMR/ggNrg2L2piOu1fJl2kVZYWZTb/u+qAO3utxTQN4y++zTSNb
 LAN78Hq/nJu156SSioO9fLa0wPaU+k2OChfWXtlMsTDK+L5EQz4G3pJwi5FA8QTD
 nfeZNC9VgqfP4LtqWw05h/AOw4A0XUeuwB8Edbc+WG5twzULqDhS57jew4A4xX8d
 jOsvK5syygnR+/rExWc0NWSmCH0g1u6mCUWXQuocfSb/oOEcUGq5RSixRNRfmJUq
 9XNF3hbp7W/Vnp9GWT30Md+CenrEtQXFK8ZQtg0ckBl+b5bEqKYs6FXGqCkUmjZy
 Qgt5sqxgdLWtslS3vSu1/mdryeaLmXNO6c6wueSPMmLyYODEoIHSSka9N9O0Inwv
 d106p7gUy3/ETamC3lbnyHkUrAru74Qh8rErKpqaRLkKfcIq7YCB073fxbqlamzz
 X4n8a1H37LefLqmKwIbF
 =pU+A
 -----END PGP SIGNATURE-----

Merge tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael Wysocki:
 "Fixes for problems introduced or discovered recently (intel_pstate,
  sti-cpufreq, ARM64 cpuidle, Operating Performance Points framework,
  generic device properties framework) and one fix for a hotplug-related
  deadlock in ACPICA that's been there forever, but is nasty enough.

  Specifics:

   - Fix for a recent regression in the intel_pstate driver causing it
     to fail to restore the HWP (HW-managed P-states) configuration of
     the boot CPU after suspend-to-RAM (Rafael Wysocki).

   - Fix for two recent regressions in the intel_pstate driver, one that
     can trigger a divide by zero if the driver is accessed via sysfs
     before it manages to take the first sample and one causing it to
     fail to update a structure field used in a trace point, so the
     information coming from it is less useful (Rafael Wysocki).

   - Fix for a problem in the sti-cpufreq driver introduced during the
     4.5 cycle that causes it to break CPU PM in multi-platform kernels
     by registering cpufreq-dt (which subsequently doesn't work)
     unconditionally and preventing the driver that would actually work
     from registering (Sudeep Holla).

   - Stable-candidate fix for an ARM64 cpuidle issue causing idle state
     usage counters to be incorrectly updated for idle states that were
     not entered due to errors (James Morse).

   - Fix for a recently introduced issue in the OPP (Operating
     Performance Points) framework causing it to print bogus error
     messages for missing optional regulators (Viresh Kumar).

   - Fix for a recently introduced issue in the generic device
     properties framework that may cause it to attempt to dereferece and
     invalid pointer in some cases (Heikki Krogerus).

   - Fix for a deadlock in the ACPICA core that may be triggered by
     device (eg Thunderbolt) hotplug (Prarit Bhargava)"

* tag 'pm+acpi-4.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / OPP: Remove useless check
  ACPICA: Dispatcher: Update thread ID for recursive method calls
  intel_pstate: Fix intel_pstate_get()
  cpufreq: intel_pstate: Fix HWP on boot CPU after system resume
  cpufreq: st: enable selective initialization based on the platform
  ARM: cpuidle: Pass on arm_cpuidle_suspend()'s return value
  device property: Avoid potential dereferences of invalid pointers
2016-05-06 11:58:45 -07:00
Rafael J. Wysocki
7c21b38ca9 Merge branches 'acpica-fixes' and 'device-properties-fixes'
* acpica-fixes:
  ACPICA: Dispatcher: Update thread ID for recursive method calls

* device-properties-fixes:
  device property: Avoid potential dereferences of invalid pointers
2016-05-06 13:15:52 +02:00
Haggai Abramovsky
73898db043 net/mlx4: Avoid wrong virtual mappings
The dma_alloc_coherent() function returns a virtual address which can
be used for coherent access to the underlying memory.  On some
architectures, like arm64, undefined behavior results if this memory is
also accessed via virtual mappings that are not coherent.  Because of
their undefined nature, operations like virt_to_page() return garbage
when passed virtual addresses obtained from dma_alloc_coherent().  Any
subsequent mappings via vmap() of the garbage page values are unusable
and result in bad things like bus errors (synchronous aborts in ARM64
speak).

The mlx4 driver contains code that does the equivalent of:
vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the
device is opened.

Prevent Ethernet driver to run this problematic code by forcing it to
allocate contiguous memory. As for the Infiniband driver, at first we
are trying to allocate contiguous memory, but in case of failure roll
back to work with fragmented memory.

Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com>
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reported-by: David Daney <david.daney@cavium.com>
Tested-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-05 23:23:05 -04:00
Andrea Arcangeli
127393fbe5 mm: thp: kvm: fix memory corruption in KVM with THP enabled
After the THP refcounting change, obtaining a compound pages from
get_user_pages() no longer allows us to assume the entire compound page
is immediately mappable from a secondary MMU.

A secondary MMU doesn't want to call get_user_pages() more than once for
each compound page, in order to know if it can map the whole compound
page.  So a secondary MMU needs to know from a single get_user_pages()
invocation when it can map immediately the entire compound page to avoid
a flood of unnecessary secondary MMU faults and spurious
atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
users).

Ideally instead of the page->_mapcount < 1 check, get_user_pages()
should return the granularity of the "page" mapping in the "mm" passed
to get_user_pages().  However it's non trivial change to pass the "pmd"
status belonging to the "mm" walked by get_user_pages up the stack (up
to the caller of get_user_pages).  So the fix just checks if there is
not a single pte mapping on the page returned by get_user_pages, and in
turn if the caller can assume that the whole compound page is mapped in
the current "mm" (in a pmd_trans_huge()).  In such case the entire
compound page is safe to map into the secondary MMU without additional
get_user_pages() calls on the surrounding tail/head pages.  In addition
of being faster, not having to run other get_user_pages() calls also
reduces the memory footprint of the secondary MMU fault in case the pmd
split happened as result of memory pressure.

Without this fix after a MADV_DONTNEED (like invoked by QEMU during
postcopy live migration or balloning) or after generic swapping (with a
failure in split_huge_page() that would only result in pmd splitting and
not a physical page split), KVM would map the whole compound page into
the shadow pagetables, despite regular faults or userfaults (like
UFFDIO_COPY) may map regular pages into the primary MMU as result of the
pte faults, leading to the guest mode and userland mode going out of
sync and not working on the same memory at all times.

Any other secondary MMU notifier manager (KVM is just one of the many
MMU notifier users) will need the same information if it doesn't want to
run a flood of get_user_pages_fast and it can support multiple
granularity in the secondary MMU mappings, so I think it is justified to
be exposed not just to KVM.

The other option would be to move transparent_hugepage_adjust to
mm/huge_memory.c but that currently has all kind of KVM data structures
in it, so it's definitely not a cut-and-paste work, so I couldn't do a
fix as cleaner as this one for 4.6.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Li, Liang Z" <liang.z.li@intel.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Alexandre Bounine
4e1016dac1 rapidio/mport_cdev: fix uapi type definitions
Fix problems in uapi definitions reported by Gabriel Laskar: (see
https://lkml.org/lkml/2016/4/5/205 for details)

 - move public header file rio_mport_cdev.h to include/uapi/linux directory
 - change types in data structures passed as IOCTL parameters
 - improve parameter checking in some IOCTL service routines

Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com>
Reported-by: Gabriel Laskar <gabriel@lse.epita.fr>
Tested-by: Barry Wood <barry.wood@idt.com>
Cc: Gabriel Laskar <gabriel@lse.epita.fr>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Andre van Herk <andre.van.herk@prodrive-technologies.com>
Cc: Barry Wood <barry.wood@idt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Johannes Weiner
4550c4e157 mm: memcontrol: let v2 cgroups follow changes in system swappiness
Cgroup2 currently doesn't have a per-cgroup swappiness setting.  We
might want to add one later - that's a different discussion - but until
we do, the cgroups should always follow the system setting.  Otherwise
it will be unchangeably set to whatever the ancestor inherited from the
system setting at the time of cgroup creation.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Eric Dumazet
46cc6e4976 tcp: fix lockdep splat in tcp_snd_una_update()
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.

This triggers a lockdep splat on 32bit & SMP builds.

We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.

I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.

Fixes: c10d9310ed ("tcp: do not assume TCP code is non preemptible")
Reported-by: Fabio Estevam <festevam@gmail.com>
Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 16:55:11 -04:00
Florian Westphal
9b36627ace net: remove dev->trans_start
previous patches removed all direct accesses to dev->trans_start,
so change the netif_trans_update helper to update trans_start of
netdev queue 0 instead and then remove trans_start from struct net_device.

AFAICS a lot of the netif_trans_update() invocations are now useless
because they occur in ndo_start_xmit and driver doesn't set LLTX
(i.e. stack already took care of the update).

As I can't test any of them it seems better to just leave them alone.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:50 -04:00
Florian Westphal
ba162f8eed netdevice: add helper to update trans_start
trans_start exists twice:
- as member of net_device (legacy)
- as member of netdev_queue

In order to get rid of the legacy case, add a helper for the
dev->trans_update (this patch), then convert spots that do

dev->trans_start = jiffies

to use this helper (next patch).

This would then allow us to change the helper so that it updates the
trans_stamp of netdev queue 0 instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:16:48 -04:00
Mohamad Haj Yahia
efdc810ba3 net/mlx5: Flow steering, Add vport ACL support
Update the relevant flow steering device structs and commands to
support vport.
Update the flow steering core API to receive vport number.
Add ingress and egress ACL flow table name spaces.
Add ACL flow table support:
* ACL (Access Control List) flow table is a table that contains
only allow/drop steering rules.

* We have two types of ACL flow tables - ingress and egress.

* ACLs handle traffic sent from/to E-Switch FDB table, Ingress refers to
traffic sent from Vport to E-Switch and Egress refers to traffic sent
from E-Switch to vport.

* Ingress ACL flow table allow/drop rules is checked against traffic
sent from VF.

* Egress ACL flow table allow/drop rules is checked against traffic sent
to VF.

Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 14:04:46 -04:00
David S. Miller
cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Linus Torvalds
7391daf2ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Some straggler bug fixes:

   1) Batman-adv DAT must consider VLAN IDs when choosing candidate
      nodes, from Antonio Quartulli.

   2) Fix botched reference counting of vlan objects and neigh nodes in
      batman-adv, from Sven Eckelmann.

   3) netem can crash when it sees GSO packets, the fix is to segment
      then upon ->enqueue.  Fix from Neil Horman with help from Eric
      Dumazet.

   4) Fix VXLAN dependencies in mlx5 driver Kconfig, from Matthew
      Finlay.

   5) Handle VXLAN ops outside of rcu lock, via a workqueue, in mlx5,
      since it can sleep.  Fix also from Matthew Finlay.

   6) Check mdiobus_scan() return values properly in pxa168_eth and macb
      drivers.  From Sergei Shtylyov.

   7) If the netdevice doesn't support checksumming, disable
      segmentation.  From Alexandery Duyck.

   8) Fix races between RDS tcp accept and sending, from Sowmini
      Varadhan.

   9) In macb driver, probe MDIO bus before we register the netdev,
      otherwise we can try to open the device before it is really ready
      for that.  Fix from Florian Fainelli.

  10) Netlink attribute size for ILA "tunnels" not calculated properly,
      fix from Nicolas Dichtel"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  ipv6/ila: fix nlsize calculation for lwtunnel
  net: macb: Probe MDIO bus before registering netdev
  RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.
  RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock
  vxlan: Add checksum check to the features check function
  net: Disable segmentation if checksumming is not supported
  net: mvneta: Remove superfluous SMP function call
  macb: fix mdiobus_scan() error check
  pxa168_eth: fix mdiobus_scan() error check
  net/mlx5e: Use workqueue for vxlan ops
  net/mlx5e: Implement a mlx5e workqueue
  net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue
  net/mlx5: Unmap only the relevant IO memory mapping
  netem: Segment GSO packets on enqueue
  batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node
  batman-adv: Fix reference counting of vlan object for tt_local_entry
  batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event
  batman-adv: fix DAT candidate selection (must use vid)
2016-05-03 15:07:50 -07:00
Alexander Duyck
af67eb9e7e vxlan: Add checksum check to the features check function
We need to perform an additional check on the inner headers to determine if
we can offload the checksum for them.  Previously this check didn't occur
so we would generate an invalid frame in the case of an IPv6 header
encapsulated inside of an IPv4 tunnel.  To fix this I added a secondary
check to vxlan_features_check so that we can verify that we can offload the
inner checksum.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:00:54 -04:00
Eric Dumazet
9580bf2edb net: relax expensive skb_unclone() in iptunnel_handle_offloads()
Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
tunnel have to go through an expensive skb_unclone()

Reallocating skb->head is a lot of work.

Test should really check if a 'real clone' of the packet was done.

TCP does not care if the original gso_type is changed while the packet
travels in the stack.

This adds skb_header_unclone() which is a variant of skb_clone()
using skb_header_cloned() check instead of skb_cloned().

This variant can probably be used from other points.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 00:22:19 -04:00