android_kernel_msm-6.1_noth.../fs
Filipe Manana 062c19e9dd Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON
There's a race between releasing extent buffers that are flagged as stale
and recycling them that makes us it the following BUG_ON at
btrfs_release_extent_buffer_page:

    BUG_ON(extent_buffer_under_io(eb))

The BUG_ON is triggered because the extent buffer has the flag
EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made
dirty by another concurrent task.

Here follows a sequence of steps that leads to the BUG_ON.

      CPU 0                                                    CPU 1                                                CPU 2

path->nodes[0] == eb X
X->refs == 2 (1 for the tree, 1 for the path)
btrfs_header_generation(X) == current trans id
flag EXTENT_BUFFER_DIRTY set on X

btrfs_release_path(path)
    unlocks X

                                                      reads eb X
                                                         X->refs incremented to 3
                                                      locks eb X
                                                      btrfs_del_items(X)
                                                         X becomes empty
                                                         clean_tree_block(X)
                                                             clear EXTENT_BUFFER_DIRTY from X
                                                         btrfs_del_leaf(X)
                                                             unlocks X
                                                             extent_buffer_get(X)
                                                                X->refs incremented to 4
                                                             btrfs_free_tree_block(X)
                                                                X's range is not pinned
                                                                X's range added to free
                                                                  space cache
                                                             free_extent_buffer_stale(X)
                                                                lock X->refs_lock
                                                                set EXTENT_BUFFER_STALE on X
                                                                release_extent_buffer(X)
                                                                    X->refs decremented to 3
                                                                    unlocks X->refs_lock
                                                      btrfs_release_path()
                                                         unlocks X
                                                         free_extent_buffer(X)
                                                             X->refs becomes 2

                                                                                                      __btrfs_cow_block(Y)
                                                                                                          btrfs_alloc_tree_block()
                                                                                                              btrfs_reserve_extent()
                                                                                                                  find_free_extent()
                                                                                                                      gets offset == X->start
                                                                                                              btrfs_init_new_buffer(X->start)
                                                                                                                  btrfs_find_create_tree_block(X->start)
                                                                                                                      alloc_extent_buffer(X->start)
                                                                                                                          find_extent_buffer(X->start)
                                                                                                                              finds eb X in radix tree

    free_extent_buffer(X)
        lock X->refs_lock
            test X->refs == 2
            test bit EXTENT_BUFFER_STALE is set
            test !extent_buffer_under_io(eb)

                                                                                                                              increments X->refs to 3
                                                                                                                              mark_extent_buffer_accessed(X)
                                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                                    --> does nothing,
                                                                                                                                        X->refs >= 2 and
                                                                                                                                        EXTENT_BUFFER_TREE_REF
                                                                                                                                        is set in X
                                                                                                              clear EXTENT_BUFFER_STALE from X
                                                                                                              locks X
                                                                                                          btrfs_mark_buffer_dirty()
                                                                                                              set_extent_buffer_dirty(X)
                                                                                                                  check_buffer_tree_ref(X)
                                                                                                                     --> does nothing, X->refs >= 2 and
                                                                                                                         EXTENT_BUFFER_TREE_REF is set
                                                                                                                  sets EXTENT_BUFFER_DIRTY on X

            test and clear EXTENT_BUFFER_TREE_REF
            decrements X->refs to 2
        release_extent_buffer(X)
            decrements X->refs to 1
            unlock X->refs_lock

                                                                                                      unlock X
                                                                                                      free_extent_buffer(X)
                                                                                                          lock X->refs_lock
                                                                                                          release_extent_buffer(X)
                                                                                                              decrements X->refs to 0
                                                                                                              btrfs_release_extent_buffer_page(X)
                                                                                                                   BUG_ON(extent_buffer_under_io(X))
                                                                                                                       --> EXTENT_BUFFER_DIRTY set on X

Fix this by making find_extent buffer wait for any ongoing task currently
executing free_extent_buffer()/free_extent_buffer_stale() if the extent
buffer has the stale flag set.
A more clean alternative would be to always increment the extent buffer's
reference count while holding its refs_lock spinlock but find_extent_buffer
is a performance critical area and that would cause lock contention whenever
multiple tasks search for the same extent buffer concurrently.

A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream
btrfs patches backported from newer kernels) was hitting this often:

[1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507!
(...)
[1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1
[1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006
[1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000
[1212302.540792] RIP: 0010:[<ffffffffa0507081>]  [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs]
(...)
[1212302.630008] Call Trace:
[1212302.630008]  [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs]
[1212302.630008]  [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs]
[1212302.630008]  [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs]
[1212302.630008]  [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs]
[1212302.630008]  [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs]
[1212302.630008]  [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs]
[1212302.630008]  [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs]
[1212302.630008]  [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs]
[1212302.630008]  [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs]
[1212302.630008]  [<ffffffff811b9e28>] evict+0xa8/0x190
[1212302.630008]  [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0

I was also able to reproduce this on a 3.19 kernel, corresponding to Chris'
integration branch from about a month ago, running the following stress
test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram):

  while true; do
     mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd
     mount /dev/sdd /mnt
     snapshot_cmd="btrfs subvolume snapshot -r /mnt"
     snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`"
     fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100
     umount /mnt
  done

Which usually triggers the BUG_ON within less than 24 hours:

[49558.618097] ------------[ cut here ]------------
[49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551!
(...)
[49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G        W      3.19.0-btrfs-next-7+ #3
[49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000
[49558.620031] RIP: 0010:[<ffffffffa0476b1a>]  [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs]
(...)
[49558.620031] Call Trace:
[49558.620031]  [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs]
[49558.620031]  [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43
[49558.620031]  [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs]
[49558.620031]  [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs]
[49558.620031]  [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs]
[49558.620031]  [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs]
[49558.620031]  [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs]
[49558.728054]  [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18
[49558.728054]  [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs]
[49558.728054]  [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs]
[49558.728054]  [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs]
[49558.728054]  [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs]
[49558.728054]  [<ffffffff81155923>] ? iterate_supers+0x60/0xc4
[49558.728054]  [<ffffffff8117966a>] ? do_sync_work+0x91/0x91
[49558.728054]  [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22
[49558.728054]  [<ffffffff81155939>] iterate_supers+0x76/0xc4
[49558.728054]  [<ffffffff811798e8>] sys_sync+0x55/0x83
[49558.728054]  [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-11 07:59:11 -07:00
..
9p VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
adfs
affs fs/affs/super.c: fix switch indentation 2015-02-17 14:34:53 -08:00
afs Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block 2015-02-12 13:50:21 -08:00
autofs4 autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation 2015-02-22 11:43:34 -05:00
befs fs/befs/linuxvfs.c: remove unnecessary casting 2015-02-17 14:34:50 -08:00
bfs
btrfs Btrfs: fix race when reusing stale extent buffers that leads to BUG_ON 2015-05-11 07:59:11 -07:00
cachefiles Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions 2015-02-22 11:38:41 -05:00
ceph Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-22 17:42:14 -08:00
cifs Revert "locks: keep a count of locks on the flctx lists" 2015-02-16 14:32:03 -05:00
coda VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
configfs configfs: Fix potential NULL d_inode dereference 2015-02-20 04:56:43 -05:00
cramfs
debugfs debugfs: leave freeing a symlink body until inode eviction 2015-02-22 11:38:43 -05:00
devpts
dlm netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
ecryptfs eCryptfs: don't pass fs-specific ioctl commands through 2015-03-03 02:03:56 -06:00
efivarfs * Move efivarfs from the misc filesystem section to pseudo filesystem, 2015-01-29 19:16:40 +01:00
efs
exofs vfs: remove get_xip_mem 2015-02-16 17:56:03 -08:00
exportfs VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
ext2 ext2: get rid of most mentions of XIP in ext2 2015-02-16 17:56:04 -08:00
ext3 ext3: destroy sbi mutexes in put_super 2015-01-05 11:13:55 +01:00
ext4 Ext4 bug fixes for 3.20. We also reserved code points for encryption 2015-02-22 18:05:13 -08:00
f2fs Merge tag 'for-f2fs-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs 2015-02-12 19:28:50 -08:00
fat fs: fat: use MSDOS_SB macro to get msdos_sb_info 2015-02-17 14:34:51 -08:00
freevxfs
fscache fs/fscache/object-list.c: use __seq_open_private() 2014-10-13 17:52:21 +01:00
fuse fuse: explicitly set /dev/fuse file's private_data 2015-03-19 15:29:22 +01:00
gfs2 VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
hfs fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp 2014-12-10 17:41:16 -08:00
hfsplus VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
hostfs
hpfs
hppfs VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
hugetlbfs fs: remove mapping->backing_dev_info 2015-01-20 14:03:05 -07:00
isofs isofs: Fix bug in the way to check if the year is a leap year 2015-01-07 09:51:49 +01:00
jbd jbd: Deletion of an unnecessary check before the function call "iput" 2014-11-18 10:15:29 +01:00
jbd2 jbd2: complain about descriptor block checksum errors 2015-01-19 15:59:58 -05:00
jffs2 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-22 17:42:14 -08:00
jfs Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-17 16:12:34 -08:00
kernfs kernfs: handle poll correctly on 'direct_read' files. 2015-03-16 21:51:20 +01:00
lockd Merge branch 'for-3.20' of git://linux-nfs.org/~bfields/linux 2015-02-12 10:39:41 -08:00
logfs
minix
ncpfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-17 14:56:45 -08:00
nfs NFSv4.1: Clear the old state by our client id before establishing a new lease 2015-03-03 21:52:30 -05:00
nfs_common lockd: move lockd's grace period handling into its own module 2014-09-17 16:33:11 -04:00
nfsd Subject: nfsd: don't recursively call nfsd4_cb_layout_fail 2015-03-19 15:49:27 -04:00
nilfs2 nilfs2: fix deadlock of segment constructor during recovery 2015-03-12 18:46:08 -07:00
nls
notify fanotify: fix event filtering with FAN_ONDIR set 2015-03-12 18:46:08 -07:00
ntfs fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info 2015-01-20 14:03:04 -07:00
ocfs2 ocfs2: make append_dio an incompat feature 2015-03-12 18:46:07 -07:00
omfs FS/OMFS: block number sanity check during fill_super operation 2014-10-14 02:18:22 +02:00
openpromfs
overlayfs ovl: upper fs should not be R/O 2015-03-18 10:29:48 +01:00
proc pagemap: do not leak physical addresses to non-privileged userspace 2015-03-17 09:31:30 -07:00
pstore pstore: Fix sprintf format specifier in pstore_dump() 2015-01-16 16:01:29 -08:00
qnx4
qnx6
quota Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2015-02-10 15:52:38 -08:00
ramfs fs: remove mapping->backing_dev_info 2015-01-20 14:03:05 -07:00
reiserfs VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
romfs fs: remove mapping->backing_dev_info 2015-01-20 14:03:05 -07:00
squashfs Squashfs: Add LZ4 compression configuration option 2014-11-27 18:48:44 +00:00
sysfs driver core patches for 3.20-rc1 2015-02-15 11:11:47 -08:00
sysv
ubifs Merge branch 'for-linus-v3.20' of git://git.infradead.org/linux-ubifs 2015-02-15 10:11:39 -08:00
udf udf: remove bool assignment to 0/1 2015-02-05 16:34:25 +01:00
ufs fs/ufs/super.c: fix potential race condition 2015-02-17 14:34:51 -08:00
xfs xfs: cancel failed transaction in xfs_fs_commit_blocks() 2015-02-24 10:15:18 +11:00
aio.c fs/aio.c: Remove duplicate function name in pr_debug messages 2015-02-20 04:56:44 -05:00
anon_inodes.c
attr.c
bad_inode.c don't bother with most of the bad_file_ops methods 2015-02-20 04:03:58 -05:00
binfmt_aout.c assorted conversions to %p[dD] 2014-11-19 13:01:20 -05:00
binfmt_elf.c x86, mm/ASLR: Fix stack randomization on 64-bit systems 2015-02-19 12:21:36 +01:00
binfmt_elf_fdpic.c handle suicide on late failure exits in execve() in search_binary_handler() 2014-10-09 02:39:00 -04:00
binfmt_em86.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
binfmt_flat.c
binfmt_misc.c unfuck binfmt_misc.c (broken by commit e6084d4) 2014-12-17 08:27:14 -05:00
binfmt_script.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
block_dev.c Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-block 2015-02-12 14:13:23 -08:00
buffer.c fs: clarify rate limit suppressed buffer I/O errors 2014-10-21 13:55:11 -06:00
char_dev.c fs: introduce f_op->mmap_capabilities for nommu mmap support 2015-01-20 14:02:58 -07:00
compat.c vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
compat_binfmt_elf.c
compat_ioctl.c
coredump.c coredump: Fix typo in comment 2015-02-20 04:56:44 -05:00
dax.c dax: add dax_zero_page_range 2015-02-16 17:56:04 -08:00
dcache.c VFS: Split DCACHE_FILE_TYPE into regular and special types 2015-02-22 11:38:38 -05:00
dcookies.c
direct-io.c fuse: honour max_read and max_write in direct_io mode 2014-09-26 21:16:51 -04:00
drop_caches.c vmscan: per memory cgroup slab shrinkers 2015-02-12 18:54:09 -08:00
eventfd.c eventfd: don't take the spinlock in eventfd_poll 2015-02-17 14:34:52 -08:00
eventpoll.c epoll: optimize setting task running after blocking 2015-02-13 21:21:40 -08:00
exec.c fs: create proper filename objects using getname_kernel() 2015-01-23 00:22:20 -05:00
fcntl.c vfs: renumber FMODE_NONOTIFY and add to uniqueness check 2015-01-08 15:10:52 -08:00
fhandle.c
file.c fs/file.c: replace get_unused_fd() with get_unused_fd_flags(0) 2014-12-10 17:41:10 -08:00
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-10-13 11:28:42 +02:00
filesystems.c
fs-writeback.c trylock_super(): replacement for grab_super_passive() 2015-02-22 11:38:42 -05:00
fs_pin.c switch the IO-triggering parts of umount to fs_pin 2015-01-25 23:17:29 -05:00
fs_struct.c
inode.c Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-17 16:12:34 -08:00
internal.h trylock_super(): replacement for grab_super_passive() 2015-02-22 11:38:42 -05:00
ioctl.c fsioctl.c: make generic_block_fiemap() signal-tolerant 2015-02-10 14:30:30 -08:00
Kconfig dax: does not work correctly with virtual aliasing caches 2015-02-16 17:56:04 -08:00
Kconfig.binfmt fs/binfmt_som: Drop kernel support for HP-UX SOM binaries 2015-02-17 16:29:36 +01:00
libfs.c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
locks.c locks: fix generic_delete_lease tracepoint to use victim pointer 2015-03-14 09:45:35 -04:00
Makefile Merge branch 'parisc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2015-02-17 14:25:58 -08:00
mbcache.c
mount.h switch the IO-triggering parts of umount to fs_pin 2015-01-25 23:17:29 -05:00
mpage.c vfs: guard end of device for mpage interface 2014-10-09 22:25:53 -04:00
namei.c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
namespace.c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
no-block.c
nsfs.c take the targets of /proc/*/ns/* symlinks to separate fs 2014-12-10 21:30:20 -05:00
open.c Merge branch 'getname2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-17 15:27:47 -08:00
pipe.c
pnode.c mnt: Move the clear of MNT_LOCKED from copy_tree to it's callers. 2014-12-02 10:46:50 -06:00
pnode.h
posix_acl.c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
proc_namespace.c vfs: add support for a lazytime mount option 2015-02-05 02:45:00 -05:00
read_write.c Merge branch 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-02-17 15:48:33 -08:00
readdir.c vfs: make first argument of dir_context.actor typed 2014-10-31 17:48:54 -04:00
select.c all arches, signal: move restart_block to struct task_struct 2015-02-12 18:54:12 -08:00
seq_file.c bitmap, cpumask, nodemask: remove dedicated formatting functions 2015-02-13 21:21:39 -08:00
signalfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
splice.c fs: add vfs_iter_{read,write} helpers 2015-01-29 00:13:13 -05:00
stack.c
stat.c
statfs.c
super.c trylock_super(): replacement for grab_super_passive() 2015-02-22 11:38:42 -05:00
sync.c vfs: add support for a lazytime mount option 2015-02-05 02:45:00 -05:00
timerfd.c fs: Convert show_fdinfo functions to void 2014-11-05 14:13:23 -05:00
utimes.c
xattr.c new helper: audit_file() 2014-11-19 13:01:26 -05:00