DevHeads.net

Null deference panic in CentOS-6.5

Hi,

I got a panic when running CentOS-6.5:

crash> bt
PID: 106074 TASK: ffff8839c1e32ae0 CPU: 4 COMMAND: "flushd4[cbd-sd-"
#0 [ffff8839c2a91900] machine_kexec at ffffffff81038fa9
#1 [ffff8839c2a91960] crash_kexec at ffffffff810c5992
#2 [ffff8839c2a91a30] oops_end at ffffffff81515c90
#3 [ffff8839c2a91a60] no_context at ffffffff81049f1b
#4 [ffff8839c2a91ab0] __bad_area_nosemaphore at ffffffff8104a1a5
#5 [ffff8839c2a91b00] bad_area_nosemaphore at ffffffff8104a273
#6 [ffff8839c2a91b10] __do_page_fault at ffffffff8104a9bf
#7 [ffff8839c2a91c30] do_page_fault at ffffffff81517bae
#8 [ffff8839c2a91c60] page_fault at ffffffff81514f95
[exception RIP: rb_next+1]
RIP: ffffffff81286e21 RSP: ffff8839c2a91d10 RFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff88204b501c00 RCX: 0000000000000000
RDX: ffff88013bc56840 RSI: ffff88013bc568d8 RDI: 0000000000000010
RBP: ffff8839c2a91d60 R8: 0000000000000001 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff8839c2a91d18] pick_next_task_fair at ffffffff81068121
#10 [ffff8839c2a91d68] schedule at ffffffff81511e08
#11 [ffff8839c2a91e28] flushd_run at ffffffffa07a2cbd [cbd]
#12 [ffff8839c2a91ee8] kthread at ffffffff8109acd6
#13 [ffff8839c2a91f48] kernel_thread at ffffffff8100c20a

The [cbd] is a module developed by us, I think this bug has nothing to
do with it.

And the contents of rq in pick_next_task(struct rq *rq) is (see
attachement for full contents of struct rq):

struct rq {
lock = {
raw_lock = {
slock = 67109881
}
},
nr_running = 2,
cpu_load = {0, 5923, 14993, 13888, 9115},
last_load_update_tick = 4365159236,
nohz_balance_kick = 0 '\000',
skip_clock_update = 0,
load = {
weight = 2,
inv_weight = 0
},
nr_load_updates = 21530842,
nr_switches = 148355748,
cfs = {
load = {
weight = 2,
inv_weight = 0
},
nr_running = 1,
h_nr_running = 2,
exec_clock = 3309310258875,
min_vruntime = 1181294560093,
tasks_timeline = {
rb_node = 0x0
},
rb_leftmost = 0x0,
tasks = {
next = 0xffff88013bc568e8,
prev = 0xffff88013bc568e8
},
balance_iterator = 0xffff88013bc568e8,
curr = 0xffff88204b501e00,
next = 0x0,
last = 0x0,
skip = 0x0,
nr_spread_over = 5,
....

We can see that the value if rq->cfs.nr_running is not zero, but
rb_leftmost is null. With skip is null, this causes null deference
panic in pick_next_entity() of pick_next_task_fair().

Does anyone have encountered same problem or advice?

Thanks

Comments

Re: Null deference panic in CentOS-6.5

By John Hodrien at 10/18/2017 - 04:41

Expect minimal help when running custom kernel modules on painfully old CentOS
kernels?

jh

Re: Null deference panic in CentOS-6.5

By wuzhouhui at 10/18/2017 - 04:50

I googled this issue and found so many people have encountered, but most
of them just said "the newer kernel doesn't have this problem, so
upgrade kernel". We can't upgrade kernel easily, so we need to *really*
solve this problem.

On 10/18/2017 04:41 PM, John Hodrien wrote:

Re: Null deference panic in CentOS-6.5

By Stephen John Smoogen at 10/18/2017 - 10:00

On 18 October 2017 at 04:50, wuzhouhui < ... at mails dot ucas.ac.cn> wrote:
If you can't update the kernel then how can anyone fix the problem?
The kernel needs to be changed out in some way. [Yes there are ways to
binary patch a running kernel but it is a) frought with danger b)
experts only area. People who do that do not offer their services for
free for a reason.]

Re: Null deference panic in CentOS-6.5

By wuzhouhui at 10/18/2017 - 10:34

Fine, it seems that upgrade kernel is the only effective solution.

Re: Null deference panic in CentOS-6.5

By Rosenthal, Shoshana at 10/18/2017 - 11:17

Please remove me from your email I stopped working
Thanks

Sent from my iPad

Re: Null deference panic in CentOS-6.5

By James Hogarth at 10/18/2017 - 10:53

On 18 October 2017 at 15:34, wuzhouhui < ... at mails dot ucas.ac.cn> wrote:
To be as abundantly clear as possible on the matter ... it is not just kernel.

You need to do a full update against the CentOS 6 repositories.