DevHeads.net

high kworker CPU usage in 3.10.0-957 w/ Xorg nouveau driver?

Hi all,

I have a number of Gnome/X desktop workstations with NVidia GeForce GT
1030 adapters, dual monitors, Core I7 3770 quad-core hyper-threaded
CPUs, with 32GB of RAM. Most (haven't checked them all yet) are
exhibiting problems that include significant sluggish-ness with mouse
movement and typing as well as screen rendering problems happening
since upgrading from kernel 3.10.0-862.14.4.el7.x86_64 to
3.10.0-957.1.3.el7.x86_64. The users have seen this behavior after
logging into Gnome, but with out any additional applications running
(Chrome/Firefox/LibreOffice, etc.). I can see in top that there are
multiple kworker processes consuming a large amount of CPU time and
unusually high load averages - like 5-7 range on the 5 minute average,
normal load average would be between 1-2 for these users. At one
point, while troubleshooting with a user, I was logged in remotely
while the user was working on the desktop when it became completely
unresponsive. /var/log/messages had nouveau messages like:

kernel: nouveau: evo channel stalled
kernel: nouveau 0000:01:00.0: disp: chid 1 mthd 0000 data 00000000
10003000 00000000
kernel: nouveau 0000:01:00.0: DRM: base-1: timeout
kernel: nouveau 0000:01:00.0: DRM: core notifier timeout

Those messages might be meaningless, but they are abundant in the
logs. For grins before rebooting, I attempted to stop and start GDM.
Both operations seemed successful, I verified all processes owned by
the user were gone, and asked him to log in again, but he reported his
screens still looked like they did before I restarted GDM and that he
didn't have a login screen.

Users are currently booting their systems to the 3.10.862 kernel, and
this problem does not present itself. I can also add that running the
proprietary nvidia driver (from nvidia.com, not elrepo) version 410.78
does not produce this problem. I config manage all these desktops
with Puppet and they were all built from by the same kickstart file.
The nvidia driver is not purposefully managed by puppet, I just
happened to be experimenting with it on my workstation.

Before I load the proprietary driver on all the problematic systems, I
was hoping someone on the list might have some insight or suggestions.

Thanks!

--Sean