Commit graph

64712 commits

Author SHA1 Message Date
Christoph Hellwig
5b88a17cfd block: optimize bvec iteration in bvec_iter_advance
There is no need to only iterate in chunks of PAGE_SIZE or less in
bvec_iter_advance, given that the callers pass in the chunk length that
they are operating on - either that already is less than PAGE_SIZE
because they do classic page-based iteration, or it is larger because
the caller operates on multi-page bvecs.

This should help shaving off a few cycles of the I/O hot path.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 13:49:22 -07:00
Ming Lei
594b9a89af block: introduce mp_bvec_for_each_page() for iterating over page
mp_bvec_for_each_segment() is a bit big for the iteration, so introduce
a light-weight helper for iterating over pages, then 32bytes stack
space can be saved.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 05:56:03 -07:00
Ming Lei
4d633062c1 block: introduce bvec_nth_page()
Single-page bvec can often be seen in small BS workloads, so
introduce bvec_nth_page() for avoiding to call nth_page() unnecessarily,
which looks not cheap.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-27 06:18:52 -07:00
Christoph Hellwig
81214bab58 iomap: wire up the iopoll method
Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to use bio_set_polled().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24 08:20:17 -07:00
Jens Axboe
0bbb280d7b block: add bio_set_polled() helper
For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Utilize the helper in the blockdev DIRECT_IO code.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24 08:20:17 -07:00
Christoph Hellwig
fb7e160019 fs: add an iopoll method to struct file_operations
This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-24 08:20:17 -07:00
Christoph Hellwig
5d8762d568 nvme-rdma: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:19 -07:00
Christoph Hellwig
8638b24614 nvme-fc: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:17 -07:00
Christoph Hellwig
055d045a7a nvme-tcp.h: fix SPDX header
For .h files we need to use /* */ style comments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:10 -07:00
Christoph Hellwig
fadccd8fc2 nvme_ioctl.h: remove duplicate GPL boilerplate
We already have a ЅPDX header, so no need to duplicate the information.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:05 -07:00
Jens Axboe
6fb845f0e7 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into for-5.1/block

Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.

* tag 'v5.0-rc6': (525 commits)
  Linux 5.0-rc6
  x86/mm: Make set_pmd_at() paravirt aware
  MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  MAINTAINERS: unify reference to xen-devel list
  x86/mm/cpa: Fix set_mce_nospec()
  futex: Handle early deadlock return correctly
  futex: Fix barrier comment
  net: dsa: b53: Fix for failure when irq is not defined in dt
  blktrace: Show requests without sector
  mips: cm: reprime error cause
  mips: loongson64: remove unreachable(), fix loongson_poweroff().
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
  signal: Better detection of synchronous signals
  ...
2019-02-15 08:43:59 -07:00
Ming Lei
56d18f62f5 block: kill BLK_MQ_F_SG_MERGE
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
2705c93742 block: kill QUEUE_FLAG_NO_SG_MERGE
Since bdced438ac ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with the
current setup of the I/O path, as it isn't going to save you a significant amount
of cycles.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
6861428921 block: always define BIO_MAX_PAGES as 256
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
07173c3ec2 block: enable multipage bvecs
This patch pulls the trigger for multi-page bvecs.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:12 -07:00
Ming Lei
6dc4f100c1 block: allow bio_for_each_segment_all() to iterate over multi-page bvec
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Ming Lei
45a3fb9529 block: introduce mp_bvec_last_segment()
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:11 -07:00
Ming Lei
d18d91740a block: introduce bio_for_each_bvec() and rq_for_each_bvec()
bio_for_each_bvec() is used for iterating over multi-page bvec for bio
split & merge code.

rq_for_each_bvec() can be used for drivers which may handle the
multi-page bvec directly, so far loop is one perfect use case.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:10 -07:00
Ming Lei
3d75ca0ade block: introduce multi-page bvec helpers
This patch introduces helpers of 'mp_bvec_iter_*' for multi-page bvec
support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Follows some multi-page bvec background:

- bvecs stored in bio->bi_io_vec is always multi-page style

- bvec(struct bio_vec) represents one physically contiguous I/O
  buffer, now the buffer may include more than one page after
  multi-page bvec is supported, and all these pages represented
  by one bvec is physically contiguous. Before multi-page bvec
  support, at most one page is included in one bvec, we call it
  single-page bvec.

- .bv_page of the bvec points to the 1st page in the multi-page bvec

- .bv_offset of the bvec is the offset of the buffer in the bvec

The effect on the current drivers/filesystem/dm/bcache/...:

- almost everyone supposes that one bvec only includes one single
  page, so we keep the sp interface not changed, for example,
  bio_for_each_segment() still returns single-page bvec

- bio_for_each_segment_all() will return single-page bvec too

- during iterating, iterator variable(struct bvec_iter) is always
  updated in multi-page bvec style, and bvec_iter_advance() is kept
  not changed

- returned(copied) single-page bvec is built in flight by bvec
  helpers from the stored multi-page bvec

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:10 -07:00
Ming Lei
19d62f6d00 block: remove bvec_iter_rewind()
Commit 7759eb23fd ("block: remove bio_rewind_iter()") removes
bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
so remove it.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:10 -07:00
Christoph Hellwig
8a2ee44a37 btrfs: look at bi_size for repair decisions
bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O.  But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying.  Use bi_size for that and shift by
PAGE_SHIFT.  This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15 08:40:10 -07:00
Linus Torvalds
73a4c52184 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Ingo Molnar:
 "irqchip driver fixes: most of them are race fixes for ARM GIC (General
  Interrupt Controller) variants, but also a fix for the ARM MMP
  (Marvell PXA168 et al) irqchip affecting OLPC keyboards"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Fix ITT_entry_size accessor
  irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable
  irqchip/gic-v3-its: Gracefully fail on LPI exhaustion
  irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID
  irqchip/gic-v4: Fix occasional VLPI drop
2019-02-10 09:54:19 -08:00
Jens Axboe
eca7abf31a block: queue flag cleanup
We have QUEUE_FLAG_DEFAULT defined, but it's not used anymore since
the legacy IO stack is gone. Kill it.

Sanitize the queue flags in general, they use spaces (for some
reason), and the space is pretty sparse. With the flags renumbered,
we can more clearly see how many we have available.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 15:42:07 -07:00
Jens Axboe
d11a399898 block: kill QUEUE_FLAG_FLUSH_NQ
We have various helpers for setting/clearing this flag, and also
a helper to check if the queue supports queueable flushes or not.
But nobody uses them anymore, kill it with fire.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-09 15:40:24 -07:00
Linus Torvalds
e5a8a11632 for-linus-20190209
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxfARoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjsgEACP8vQzbvsOZOxHKi9Vcd8ziwyjyBebNh4F
 cKOx2Blgv0ReVAqLOVp9VJOJQoVQumV1btaA2YrmevxnCMpNUBpbP6G02tAqe9Z+
 D75FSpZXy4UvcMSlhfc/iB/RMI06benI9LnuL7zbzIQtrbtu+OFRnO6fpQOVGLxT
 Qa1wt/Rgahc48L4aHnIgPn0nyBRsEvuhC6FjI2D8akDaNiaHzwtGbpx7yDdmLNml
 fCzC2uSRJ31bXsO/5/fJorinaJ56r5N8aHaINYwXDv8zd8i94nQZhITAasXub1Km
 0nyuAg/fSzIdkrGmPINTKFaGYsOfRwpS4C4vagreBhzjfolPY0z9sQEQ63gZzDrd
 mAjHPxLTd165OLlR/RxoMC8AjZCZ0/YQaucxUOPkaIHfth5/dy5BFaCkWyA/I7/Z
 VnAyq0SqeL4hgIOGxZM0HeehKx+palNdJNZTcY7vF/7MVPuh5WM6z/FWsFa8k+ss
 B9YN4wchh7I8EVbLmfz9s/eqabRWF3Agh1dE+yAKwt1KIWHaMXWZTRQnj/69fs2e
 s3pwVMiiSz6K/Xnoe12nmQ4K0XeyKNROO78IIGY/Oa0Pe/hzCAaJMRMDsLp5EcJj
 dxpoi1OfGHMGoqYhL6tx6Atq5f6CMDrS28k/D44DHfO7T1qQGVy1A9SY7ZCfM5+c
 HKxTuRh8mg==
 =tuL6
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190209' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph, fixing namespace locking when
   dealing with the effects log, and a rapid add/remove issue (Keith)

 - blktrace tweak, ensuring requests with -1 sectors are shown (Jan)

 - link power management quirk for a Smasung SSD (Hans)

 - m68k nfblock dynamic major number fix (Chengguang)

 - series fixing blk-iolatency inflight counter issue (Liu)

 - ensure that we clear ->private when setting up the aio kiocb (Mike)

 - __find_get_block_slow() rate limit print (Tetsuo)

* tag 'for-linus-20190209' of git://git.kernel.dk/linux-block:
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  blktrace: Show requests without sector
  fs: ratelimit __find_get_block_slow() failure message.
  m68k: set proper major_num when specifying module param major_num
  libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD
  nvme-pci: fix rapid add remove sequence
  nvme: lock NS list changes while handling command effects
  aio: initialize kiocb private in case any filesystems expect it.
2019-02-09 10:26:09 -08:00
Linus Torvalds
6b2912cedc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull signal fixes from Eric Biederman:
 "This contains four small fixes for signal handling. A missing range
  check, a regression fix, prioritizing signals we have already started
  a signal group exit for, and better detection of synchronous signals.

  The confused decision of which signals to handle failed spectacularly
  when a timer was pointed at SIGBUS and the stack overflowed. Resulting
  in an unkillable process in an infinite loop instead of a SIGSEGV and
  core dump"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  signal: Better detection of synchronous signals
  signal: Always notice exiting tasks
  signal: Always attempt to allocate siginfo for SIGSTOP
  signal: Make siginmask safe when passed a signal of 0
2019-02-08 15:39:28 -08:00
Linus Torvalds
27b4ad621e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "This pull request is dedicated to the upcoming snowpocalypse parts 2
  and 3 in the Pacific Northwest:

   1) Drop profiles are broken because some drivers use dev_kfree_skb*
      instead of dev_consume_skb*, from Yang Wei.

   2) Fix IWLWIFI kconfig deps, from Luca Coelho.

   3) Fix percpu maps updating in bpftool, from Paolo Abeni.

   4) Missing station release in batman-adv, from Felix Fietkau.

   5) Fix some networking compat ioctl bugs, from Johannes Berg.

   6) ucc_geth must reset the BQL queue state when stopping the device,
      from Mathias Thore.

   7) Several XDP bug fixes in virtio_net from Toshiaki Makita.

   8) TSO packets must be sent always on queue 0 in stmmac, from Jose
      Abreu.

   9) Fix socket refcounting bug in RDS, from Eric Dumazet.

  10) Handle sparse cpu allocations in bpf selftests, from Martynas
      Pumputis.

  11) Make sure mgmt frames have enough tailroom in mac80211, from Felix
      Feitkau.

  12) Use safe list walking in sctp_sendmsg() asoc list traversal, from
      Greg Kroah-Hartman.

  13) Make DCCP's ccid_hc_[rt]x_parse_options always check for NULL
      ccid, from Eric Dumazet.

  14) Need to reload WoL password into bcmsysport device after deep
      sleeps, from Florian Fainelli.

  15) Remove filter from mask before freeing in cls_flower, from Petr
      Machata.

  16) Missing release and use after free in error paths of s390 qeth
      code, from Julian Wiedmann.

  17) Fix lockdep false positive in dsa code, from Marc Zyngier.

  18) Fix counting of ATU violations in mv88e6xxx, from Andrew Lunn.

  19) Fix EQ firmware assert in qed driver, from Manish Chopra.

  20) Don't default Caivum PTP to Y in kconfig, from Bjorn Helgaas"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
  net: dsa: b53: Fix for failure when irq is not defined in dt
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  net: Don't default Cavium PTP driver to 'y'
  net: broadcom: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: via-velocity: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: tehuti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: sun: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: fsl_ucc_hdlc: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: fec_mpc52xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: smsc: epic100: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: dscc4: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: tulip: de2104x: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net/mlx5e: Don't overwrite pedit action when multiple pedit used
  net/mlx5e: Update hw flows when encap source mac changed
  qed*: Advance drivers version to 8.37.0.20
  qed: Change verbosity for coalescing message.
  qede: Fix system crash on configuring channels.
  qed: Consider TX tcs while deriving the max num_queues for PF.
  ...
2019-02-08 11:21:54 -08:00
Thomas Gleixner
8087f40736 irqchip update for 5.0-rc6
- Another GICv3 ITS fix for devices sharing the same DevID
 - Don't return invalid data on exhaustion of the GICv3 LPI pool
 - Fix a GICv3 field decoding bug leading to memory over-allocation
 - Init GICv4 at boot time instead of lazy init
 - Fix interrupt masking on PJ4
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAlxaspYVHG1hcmMuenlu
 Z2llckBhcm0uY29tAAoJECPQ0LrRPXpDMS8QAIUNTdbpgtCc/3Lr4n1qHlzGCR1N
 n4FHYgpR1txcPcVyu92jLxEwnlzSw2zjexuqUNaUYVUKe6wzEAXpDba+dDhvLrW5
 JfT1RE7cDMWu34iKAorGg1sqAb3yl67d6Oi88abbNdv3zSa9EVyVBRp3eTWotNzg
 EAtVorirI41CZNirt3ZU0TRMX0wAc5g7HTq0AwBcQSsg0P6ZCyVqZU5w/JNsPzWH
 VIpYue8XlDNQ8SDJHRppHhyrghVHG0tpSHKo9I8cA9+4CK6CMd8OwZhdNsf2ISiL
 vcEOyuOQfIIyTry59F11eiwlMNa4nmsLlmzezOZUeb3NafxsqMmDsbjbOiDMIkzh
 WUt8BlWOSyy2sWlokX3Qu9CLlWxa+QydGokNdUufLYOAVQG+VDUD5lpaposUSNAe
 0thGXenlBO43NqTdldZu2L6E0Wb0M0CTuILOSg6M+6olTR5hsGJiUaV+3WLJHQi3
 gLzbAFu5JRzpJWqOPh/9pF5dQQIZ8Ls/C2oJlYt+CTiYA3je4pzWzEsmP2CBTYUy
 WZEq/mhcze0kFu2b7FnYNhvYkSPool4t5GajwoBYb34N2rnJa7KVYsBDAV0gUJ1t
 MZbTTzWUeWpsm31uZlvKiPSDHqpUPucZ9HtqBH8OKef6IZz4tMK7v9qx1yZ/C2ZL
 dSNbXHBEI7X44Ea2
 =Crvb
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent

Pull irqchip updates from Marc Zyngier:

- Another GICv3 ITS fix for devices sharing the same DevID
- Don't return invalid data on exhaustion of the GICv3 LPI pool
- Fix a GICv3 field decoding bug leading to memory over-allocation
- Init GICv4 at boot time instead of lazy init
- Fix interrupt masking on PJ4
2019-02-07 21:12:40 +01:00
Jan Kara
0803de7804 blktrace: Show requests without sector
Currently, blktrace will not show requests that don't have any data as
rq->__sector is initialized to -1 which is out of device range and thus
discarded by act_log_check(). This is most notably the case for cache
flush requests sent to the device. Fix the problem by making
blk_rq_trace_sector() return 0 for requests without initialized sector.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-07 12:57:17 -07:00
Linus Torvalds
d47e3da175 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fix from Jiri Kosina:
 "A fix for a bug in hid-debug that can lock up the kernel in infinite
  loop (CVE-2019-3819), from Vladis Dronov"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: debug: fix the ring buffer implementation
2019-02-07 11:51:31 +00:00
Linus Torvalds
24b888d8d5 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "A few updates for x86:

   - Fix an unintended sign extension issue in the fault handling code

   - Rename the new resource control config switch so it's less
     confusing

   - Avoid setting up EFI info in kexec when the EFI runtime is
     disabled.

   - Fix the microcode version check in the AMD microcode loader so it
     only loads higher version numbers and never downgrades

   - Set EFER.LME in the 32bit trampoline before returning to long mode
     to handle older AMD/KVM behaviour properly.

   - Add Darren and Andy as x86/platform reviewers"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/resctrl: Avoid confusion over the new X86_RESCTRL config
  x86/kexec: Don't setup EFI info if EFI runtime is not enabled
  x86/microcode/amd: Don't falsely trick the late loading mechanism
  MAINTAINERS: Add Andy and Darren as arch/x86/platform/ reviewers
  x86/fault: Fix sign-extend unintended sign extension
  x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
  x86/cpu: Add Atom Tremont (Jacobsville)
2019-02-03 09:08:12 -08:00
Linus Torvalds
cc6810e36b Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu hotplug fixes from Thomas Gleixner:
 "Two fixes for the cpu hotplug machinery:

   - Replace the overly clever 'SMT disabled by BIOS' detection logic as
     it breaks KVM scenarios and prevents speculation control updates
     when the Hyperthreads are brought online late after boot.

   - Remove a redundant invocation of the speculation control update
     function"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
  x86/speculation: Remove redundant arch_smt_update() invocation
2019-02-03 09:02:03 -08:00
Linus Torvalds
c8864cb70f for-linus-20190202
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxVxeIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq35EADCoVRFF8mi7wBhNvfN/yLqC9sgnqJM7vF3
 gPpm/E4MsTzCXrhgRmulrikfM8ywOT5ZBgIQp+BrhQgDZIlJ9fcyinFjU7o/gRm5
 R32IMJ4o7uh+YKWQyRSQu1WCaF3hvbNGUT7duMfTKjZ6t9TxTBgy46wYhb7YmNAP
 Ur8C1+4NfF/aHp59n6COM70KdYfswRxtgdEHGfmHAbmKDpvAeC+7I2LPpjTf5bH/
 YULnxk5sQFvCINDJH4zprZ0lSDy+63qk+Q1xxqpFNR1tnmRIVNvKhPPB5xfRLvzB
 mw3JdtwnvoY8Yv6eCs0u7mZs5L3I8zTI+4RtHH9nPD7ykIiUHpejgaRc4TGy2tox
 Dpgfc/Nvdsscpuy4QcFNHbWWBUu5pa5Li+KVS0FEP9FmD5UhcvVmZaZK+V6NuGO4
 9G9wraASFasPK7I0FMHlWLMIWKdj4s4n/H55QnP6yvFsnsrFaqnDwybUFiicjFkv
 hmNQmq8+5p0n3ZBVGQ/SI//vPUjuaFUKU2MhhW0NVz+KkgmEnOJ+W+C77//U33l+
 zhBURtdlnqfImFejEayhhtCtMATJcf2E1rHlA3nM6JyVWMRvQR6asb78QhAKZd6w
 el7rqRdWMwryUYFaROVurfOBMPFdhhyDB1qrOzAZIkhFqWwM9GX6+MH/y8+00HSA
 aA/rXg/daw==
 =8Zz+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20190202' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes that should go into this release. This contains:

   - MD pull request from Song, fixing a recovery OOM issue (Alexei)

   - Fix for a sync related stall (Jianchao)

   - Dummy callback for timeouts (Tetsuo)

   - IDE atapi sense ordering fix (me)"

* tag 'for-linus-20190202' of git://git.kernel.dk/linux-block:
  ide: ensure atapi sense request aren't preempted
  blk-mq: fix a hung issue when fsync
  block: pass no-op callback to INIT_WORK().
  md/raid5: fix 'out of memory' during raid cache recovery
2019-02-02 10:16:28 -08:00
Johannes Weiner
e6d429313e x86/resctrl: Avoid confusion over the new X86_RESCTRL config
"Resource Control" is a very broad term for this CPU feature, and a term
that is also associated with containers, cgroups etc. This can easily
cause confusion.

Make the user prompt more specific. Match the config symbol name.

 [ bp: In the future, the corresponding ARM arch-specific code will be
   under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
   under the CPU_RESCTRL umbrella symbol. ]

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Babu Moger <Babu.Moger@amd.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morse <james.morse@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: linux-doc@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pu Wen <puwen@hygon.cn>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
2019-02-02 10:34:52 +01:00
Qian Cai
b13bc35193 mm/hotplug: invalid PFNs from pfn_to_online_page()
On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
not directly mapped (MEMBLOCK_NOMAP).  Hence, the page->flags is
uninitialized.

This is due to the commit 9f1eb38e0e ("mm, kmemleak: little
optimization while scanning") starts to use pfn_to_online_page() instead
of pfn_valid().  However, in the CONFIG_MEMORY_HOTPLUG=y case,
pfn_to_online_page() does not call memblock_is_map_memory() while
pfn_valid() does.

Historically, the commit 68709f4538 ("arm64: only consider memblocks
with NOMAP cleared for linear mapping") causes pages marked as nomap
being no long reassigned to the new zone in memmap_init_zone() by
calling __init_single_page().

Since the commit 2d070eab2e ("mm: consider zone which is not fully
populated to have holes") introduced pfn_to_online_page() and was
designed to return a valid pfn only, but it is clearly broken on arm64.

Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
handle nomap thanks to the commit f52bb98f5a ("arm64: mm: always
enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
architectures where have no HOLES_IN_ZONE.

[1]
  Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
  Mem abort info:
    ESR = 0x96000005
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
  Internal error: Oops: 96000005 [#1] SMP
  CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
  pstate: 60400009 (nZCv daif +PAN -UAO)
  pc : page_mapping+0x24/0x144
  lr : __dump_page+0x34/0x3dc
  sp : ffff00003a5cfd10
  x29: ffff00003a5cfd10 x28: 000000000000802f
  x27: 0000000000000000 x26: 0000000000277d00
  x25: ffff000010791f56 x24: ffff7fe000000000
  x23: ffff000010772f8b x22: ffff00001125f670
  x21: ffff000011311000 x20: ffff000010772f8b
  x19: fffffffffffffffe x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000
  x15: 0000000000000000 x14: ffff802698b19600
  x13: ffff802698b1a200 x12: ffff802698b16f00
  x11: ffff802698b1a400 x10: 0000000000001400
  x9 : 0000000000000001 x8 : ffff00001121a000
  x7 : 0000000000000000 x6 : ffff0000102c53b8
  x5 : 0000000000000000 x4 : 0000000000000003
  x3 : 0000000000000100 x2 : 0000000000000000
  x1 : ffff000010772f8b x0 : ffffffffffffffff
  Process kmemleak (pid: 1408, stack limit = 0x(____ptrval____))
  Call trace:
   page_mapping+0x24/0x144
   __dump_page+0x34/0x3dc
   dump_page+0x28/0x4c
   kmemleak_scan+0x4ac/0x680
   kmemleak_scan_thread+0xb4/0xdc
   kthread+0x12c/0x13c
   ret_from_fork+0x10/0x18
  Code: d503201f f9400660 36000040 d1000413 (f9400661)
  ---[ end trace 4d4bd7f573490c8e ]---
  Kernel panic - not syncing: Fatal exception
  SMP: stopping secondary CPUs
  Kernel Offset: disabled
  CPU features: 0x002,20000c38
  Memory Limit: none
  ---[ end Kernel panic - not syncing: Fatal exception ]---

Link: http://lkml.kernel.org/r/20190122132916.28360-1-cai@lca.pw
Fixes: 9f1eb38e0e ("mm, kmemleak: little optimization while scanning")
Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
Tetsuo Handa
9bcdeb51bd oom, oom_reaper: do not enqueue same task twice
Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0.  It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.

This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

  T1@P1     |T2@P1     |T3@P1     |OOM reaper
  ----------+----------+----------+------------
                                   # Processing an OOM victim in a different memcg domain.
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(&oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(&oom_lock)
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(&oom_lock)
                            out_of_memory()
                              oom_kill_process(P1)
                                do_send_sig_info(SIGKILL, @P1)
                                mark_oom_victim(T1@P1)
                                wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                            mutex_unlock(&oom_lock)
                 out_of_memory()
                   mark_oom_victim(T2@P1)
                   wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                 mutex_unlock(&oom_lock)
      out_of_memory()
        mark_oom_victim(T1@P1)
        wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
      mutex_unlock(&oom_lock)
                                   # Completed processing an OOM victim in a different memcg domain.
                                   spin_lock(&oom_reaper_lock)
                                   # T1P1 is dequeued.
                                   spin_unlock(&oom_reaper_lock)

but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.

Fix this bug using an approach used by commit 855b018325 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.

Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
Fixes: af8e15cc85 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: Jay Kamat <jgkamat@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 15:46:23 -08:00
David S. Miller
e7b816415e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-01-31

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) disable preemption in sender side of socket filters, from Alexei.

2) fix two potential deadlocks in syscall bpf lookup and prog_register,
   from Martin and Alexei.

3) fix BTF to allow typedef on func_proto, from Yonghong.

4) two bpftool fixes, from Jiri and Paolo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:28:07 -08:00
Linus Torvalds
3325254ca1 Power management fixes for 5.0-rc5
- Since ktime_get() turns out to be problematic for device
    autosuspend in the PM-runtime framework, make it use
    ktime_get_mono_fast_ns() instead (Vincent Guittot).
 
  - Fix an initial value of a local variable in the "poll idle state"
    code that makes it behave not exactly as expected when all idle
    states except for the "polling" one are disabled (Doug Smythies).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJcVB7NAAoJEILEb/54YlRxJNsP+gM+tCQ4oUTj5Q1sZnYDF3Ti
 VZCBCaN4JfRdNj+grYNAi/MHRNG2jox+Epn7EFHahtRAtbck9TEmnqfP8uJrV5fY
 mJ0mNgkt09O2PI0U+8IhSjC71Nba7zq5IPkhMr9+Gha3f3UsTGD0HCJz3fHm/50W
 coilIUTH5usdnbGZjiLhbVb9d9koJjBCWOgYZsQp+flAoj4aasw4xVuu7AoQ6ZOy
 hqjhyUPSPlE35ZVi3NaXUYeDmPgx6IJ+Bs/SQvMEcH9fyCKfEIL82S0yBZxys7j2
 ZqE2qBsEPl8elZFFNS7hwwezL0ny2BpzzrODlSrVafSr0kuS6r+sRDgkpn8whwZw
 cMmf/E7HXrgwpdnvjMywJ3vOVNsXu1QwDI21c8E3lBG08wlk1yWzpMB6BK+zusmC
 TvOjfPEEosA6RUC9rOeFU6h3OhWr3AJjCfBzkK0gkTLC3jWlPWwqor0vWR4EaSo5
 5atl8hfZPEVBkPvIix6i1luP1Kvew2Ud7LR60MTAfVJAag6/AfxYAF4IuRCdj4uO
 BLdFdN6PBsT416BJ0w6qBWxsHX0qiNMH9O1B8jNWyrAIMjgKz2cRexdwBWmxVBnn
 lijnyiH7owGi3qbjnOQ86mU7XvnApnyf/+E8tFuj4VveLqxvItLvb7rT/Zn6xDTn
 ZbQXI+5nYrmLbI6AB58u
 =3UG4
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a PM-runtime framework regression introduced by the recent
  switch-over of device autosuspend to hrtimers and a mistake in the
  "poll idle state" code introduced by a recent change in it.

  Specifics:

   - Since ktime_get() turns out to be problematic for device
     autosuspend in the PM-runtime framework, make it use
     ktime_get_mono_fast_ns() instead (Vincent Guittot).

   - Fix an initial value of a local variable in the "poll idle state"
     code that makes it behave not exactly as expected when all idle
     states except for the "polling" one are disabled (Doug Smythies)"

* tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle: poll_state: Fix default time limit
  PM-runtime: Fix deadlock with ktime_get()
2019-02-01 10:23:39 -08:00
Alexei Starovoitov
6cab5e90ab bpf: run bpf programs with preemption disabled
Disabled preemption is necessary for proper access to per-cpu maps
from BPF programs.

But the sender side of socket filters didn't have preemption disabled:
unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN

and a combination of af_packet with tun device didn't disable either:
tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue->
  tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN

Disable preemption before executing BPF programs (both classic and extended).

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-31 23:14:55 +01:00
Jens Axboe
9a6d548800 ide: ensure atapi sense request aren't preempted
There's an issue with how sense requests are handled in IDE. If ide-cd
encounters an error, it queues a sense request. With how IDE request
handling is done, this is the next request we need to handle. But it's
impossible to guarantee this, as another request could come in between
the sense being queued, and ->queue_rq() being run and handling it. If
that request ALSO fails, then we attempt to doubly queue the single
sense request we have.

Since we only support one active request at the time, defer request
processing when a sense request is queued.

Fixes: 600335205b "ide: convert to blk-mq"
Reported-by: He Zhe <zhe.he@windriver.com>
Tested-by: He Zhe <zhe.he@windriver.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-31 08:25:09 -07:00
Zenghui Yu
56841070cc irqchip/gic-v3-its: Fix ITT_entry_size accessor
According to ARM IHI 0069C (ID070116), we should use GITS_TYPER's
bits [7:4] as ITT_entry_size instead of [8:4]. Although this is
pretty annoying, it only results in a potential over-allocation
of memory, and nothing bad happens.

Fixes: 3dfa576bfb ("irqchip/gic-v3-its: Add probing for VLPI properties")
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
[maz: massaged subject and commit message]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2019-01-31 12:51:33 +00:00
Jose Abreu
4ec5302fa9 net: stmmac: Fallback to Platform Data clock in Watchdog conversion
If we don't have DT then stmmac_clk will not be available. Let's add a
new Platform Data field so that we can specify the refclk by this mean.

This way we can still use the coalesce command in PCI based setups.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30 22:24:49 -08:00
Daniel Borkmann
d5256083f6 ipvlan, l3mdev: fix broken l3s mode wrt local routes
While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.

Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net->loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.

Now judging from 4fbae7d83c ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb->dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.

Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev->priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.

  [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

Fixes: f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Martynas Pumputis <m@lambda.lt>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30 22:13:34 -08:00
Vincent Guittot
15efb47dc5 PM-runtime: Fix deadlock with ktime_get()
A deadlock has been seen when swicthing clocksources which use
PM-runtime.  The call path is:

change_clocksource
    ...
    write_seqcount_begin
    ...
    timekeeping_update
        ...
        sh_cmt_clocksource_enable
            ...
            rpm_resume
                pm_runtime_mark_last_busy
                    ktime_get
                        do
                            read_seqcount_begin
                        while read_seqcount_retry
    ....
    write_seqcount_end

Although we should be safe because we haven't yet changed the
clocksource at that time, we can't do that because of seqcount
protection.

Use ktime_get_mono_fast_ns() instead which is lock safe for such
cases.

With ktime_get_mono_fast_ns, the timestamp is not guaranteed to be
monotonic across an update and as a result can goes backward.
According to update_fast_timekeeper() description: "In the worst
case, this can result is a slightly wrong timestamp (a few
nanoseconds)". For PM-runtime autosuspend, this means only that
the suspend decision may be slightly suboptimal.

Fixes: 8234f6734c ("PM-runtime: Switch autosuspend over to using hrtimers")
Reported-by: Biju Das <biju.das@bp.renesas.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-01-30 22:49:06 +01:00
Waiman Long
af0c9af1b3 fs/dcache: Track & report number of negative dentries
The current dentry number tracking code doesn't distinguish between
positive & negative dentries.  It just reports the total number of
dentries in the LRU lists.

As excessive number of negative dentries can have an impact on system
performance, it will be wise to track the number of positive and
negative dentries separately.

This patch adds tracking for the total number of negative dentries in
the system LRU lists and reports it in the 5th field in the
/proc/sys/fs/dentry-state file.  The number, however, does not include
negative dentries that are in flight but not in the LRU yet as well as
those in the shrinker lists which are on the way out anyway.

The number of positive dentries in the LRU lists can be roughly found by
subtracting the number of negative dentries from the unused count.

Matthew Wilcox had confirmed that since the introduction of the
dentry_stat structure in 2.1.60, the dummy array was there, probably for
future extension.  They were not replacements of pre-existing fields.
So no sane applications that read the value of /proc/sys/fs/dentry-state
will do dummy thing if the last 2 fields of the sysctl parameter are not
zero.  IOW, it will be safe to use one of the dummy array entry for
negative dentry count.

Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30 11:02:11 -08:00
Waiman Long
7d10f70fc1 fs: Don't need to put list_lru into its own cacheline
The list_lru structure is essentially just a pointer to a table of
per-node LRU lists.  Even if CONFIG_MEMCG_KMEM is defined, the list
field is just used for LRU list registration and shrinker_id is set at
initialization.  Those fields won't need to be touched that often.

So there is no point to make the list_lru structures to sit in their own
cachelines.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30 11:02:11 -08:00
Josh Poimboeuf
b284909aba cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
With the following commit:

  73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")

... the hotplug code attempted to detect when SMT was disabled by BIOS,
in which case it reported SMT as permanently disabled.  However, that
code broke a virt hotplug scenario, where the guest is booted with only
primary CPU threads, and a sibling is brought online later.

The problem is that there doesn't seem to be a way to reliably
distinguish between the HW "SMT disabled by BIOS" case and the virt
"sibling not yet brought online" case.  So the above-mentioned commit
was a bit misguided, as it permanently disabled SMT for both cases,
preventing future virt sibling hotplugs.

Going back and reviewing the original problems which were attempted to
be solved by that commit, when SMT was disabled in BIOS:

  1) /sys/devices/system/cpu/smt/control showed "on" instead of
     "notsupported"; and

  2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.

I'd propose that we instead consider #1 above to not actually be a
problem.  Because, at least in the virt case, it's possible that SMT
wasn't disabled by BIOS and a sibling thread could be brought online
later.  So it makes sense to just always default the smt control to "on"
to allow for that possibility (assuming cpuid indicates that the CPU
supports SMT).

The real problem is #2, which has a simple fix: change vmx_vm_init() to
query the actual current SMT state -- i.e., whether any siblings are
currently online -- instead of looking at the SMT "control" sysfs value.

So fix it by:

  a) reverting the original "fix" and its followup fix:

     73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
     bc2d8d262c ("cpu/hotplug: Fix SMT supported evaluation")

     and

  b) changing vmx_vm_init() to query the actual current SMT state --
     instead of the sysfs control value -- to determine whether the L1TF
     warning is needed.  This also requires the 'sched_smt_present'
     variable to exported, instead of 'cpu_smt_control'.

Fixes: 73d5e2b472 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joe Mario <jmario@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
2019-01-30 19:27:00 +01:00
Vladis Dronov
13054abbaa HID: debug: fix the ring buffer implementation
Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
is strange allowing lost or corrupted data. After commit 717adfdaf1
("HID: debug: check length before copy_to_user()") it is possible to enter
an infinite loop in hid_debug_events_read() by providing 0 as count, this
locks up a system. Fix this by rewriting the ring buffer implementation
with kfifo and simplify the code.

This fixes CVE-2019-3819.

v2: fix an execution logic and add a comment
v3: use __set_current_state() instead of set_current_state()

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187
Cc: stable@vger.kernel.org # v4.18+
Fixes: cd667ce247 ("HID: use debugfs for events/reports dumping")
Fixes: 717adfdaf1 ("HID: debug: check length before copy_to_user()")
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
2019-01-29 12:09:11 +01:00
Linus Torvalds
9881051828 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "A small series of fixes which all address possible missed wakeups:

   - Document and fix the wakeup ordering of wake_q

   - Add the missing barrier in rcuwait_wake_up(), which was documented
     in the comment but missing in the code

   - Fix the possible missed wakeups in the rwsem and futex code"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Fix (possible) missed wakeup
  futex: Fix (possible) missed wakeup
  sched/wake_q: Fix wakeup ordering for wake_q
  sched/wake_q: Document wake_q_add()
  sched/wait: Fix rcuwait_wake_up() ordering
2019-01-27 11:52:50 -08:00
Linus Torvalds
0d484375d7 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 "A small set of fixes for the interrupt subsystem:

   - Fix a double increment in the irq descriptor allocator which
     resulted in a sanity check only being done for every second
     affinity mask

   - Add a missing device tree translation in the stm32-exti driver.
     Without that the interrupt association is completely wrong.

   - Initialize the mutex in the GIC-V3 MBI driver

   - Fix the alignment for aliasing devices in the GIC-V3-ITS driver so
     multi MSI allocations work correctly

   - Ensure that the initial affinity of a interrupt is not empty at
     startup time.

   - Drop bogus include in the madera irq chip driver

   - Fix KernelDoc regression"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Align PCI Multi-MSI allocation on their size
  genirq/irqdesc: Fix double increment in alloc_descs()
  genirq: Fix the kerneldoc comment for struct irq_affinity_desc
  irqchip/madera: Drop GPIO includes
  irqchip/gic-v3-mbi: Fix uninitialized mbi_lock
  irqchip/stm32-exti: Add domain translate function
  genirq: Make sure the initial affinity is not empty
2019-01-27 11:25:38 -08:00