Discussion:
Bug#767261: [Pkg-xen-devel] Bug#767261: xen-hypervisor-4.4-amd64: host lockup when DomU network iface is down
(too old to reply)
Gedalya
2014-11-06 16:20:02 UTC
Permalink
Control: reassign -1 src:linux
Control: found -1 3.16.5-1
On dom0 I get messages like 'vif vif-10-0 vif10.0: draining TX queue',
starting as soon as the domU's boot up. I'm pretty sure this is a
regression from Xen 4.1 in wheezy.
[...]
dom0 and domU kernel is linux 3.16-3-amd64 3.16.5-1
This is most likely to be a dom0 kernel side issue, so reassigning.
Are there any interesting messages preceeding the "draining TX queue"
ones?
No... nothing at all.
I suspect we will need to backport some xen-netback patch or other. I've
put some feelers out to see if any of the upstream devs have any
hints...
OK so if it's just a matter of changing a kernel on one box, I can
perhaps try to build a 3.18 this weekend
Ian.
--
To UNSUBSCRIBE, email to debian-bugs-dist-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Ian Campbell
2014-11-07 08:30:02 UTC
Permalink
Post by Gedalya
I suspect we will need to backport some xen-netback patch or other. I've
put some feelers out to see if any of the upstream devs have any
hints...
OK so if it's just a matter of changing a kernel on one box, I can
perhaps try to build a 3.18 this weekend
I think these commits, which are in v3.18-rc3, are probably the ones:

ecf08d2 xen-netback: reintroduce guest Rx stall detection
f48da8b xen-netback: fix unlimited guest Rx internal queue and carrier flapping
bc96f64 xen-netback: make feature-rx-notify mandatory

I'll investigate a backport/check if they are destined for ***@.

Ian.
--
To UNSUBSCRIBE, email to debian-bugs-dist-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Gedalya
2014-11-08 05:50:02 UTC
Permalink
Post by Ian Campbell
Post by Gedalya
I suspect we will need to backport some xen-netback patch or other. I've
put some feelers out to see if any of the upstream devs have any
hints...
OK so if it's just a matter of changing a kernel on one box, I can
perhaps try to build a 3.18 this weekend
ecf08d2 xen-netback: reintroduce guest Rx stall detection
f48da8b xen-netback: fix unlimited guest Rx internal queue and carrier flapping
bc96f64 xen-netback: make feature-rx-notify mandatory
Ian.
Tried to just frankenport xen-netback from 3.18 into 3.16, didn't work
very well ;-)
I'm running 3.18rc3+ now. Bombarding the downed interface by
broadcast-pinging the network it's on causes the following
[ 281.396014] vif vif-3-0 vif3.0: Guest Rx stalled
[ 281.396080] breth1: port 3(vif3.0) entered disabled state
and that's it. This is instead of the previously repeated 'draining TX
queue' messages.
Let's assume it won't crash, I'll let you know if this assumption turns
out to be wrong.

I'm kind of curious why this is preceded by
[ 46.232475] vif vif-3-0 vif3.0: Guest Rx ready
[ 46.232514] IPv6: ADDRCONF(NETDEV_CHANGE): vif3.0: link becomes ready
And the host figures out it's down only when traffic comes and doesn't
get through.
I guess this might change if I run 3.18 in the guest too?
--
To UNSUBSCRIBE, email to debian-bugs-dist-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Ian Campbell
2014-11-08 10:50:01 UTC
Permalink
Post by Gedalya
Post by Ian Campbell
Post by Gedalya
I suspect we will need to backport some xen-netback patch or other. I've
put some feelers out to see if any of the upstream devs have any
hints...
OK so if it's just a matter of changing a kernel on one box, I can
perhaps try to build a 3.18 this weekend
ecf08d2 xen-netback: reintroduce guest Rx stall detection
f48da8b xen-netback: fix unlimited guest Rx internal queue and carrier flapping
bc96f64 xen-netback: make feature-rx-notify mandatory
Ian.
Tried to just frankenport xen-netback from 3.18 into 3.16, didn't work
very well ;-)
Did you backport just the above or the full set of changes from 3.18?
Post by Gedalya
I'm running 3.18rc3+ now. Bombarding the downed interface by
broadcast-pinging the network it's on causes the following
[ 281.396014] vif vif-3-0 vif3.0: Guest Rx stalled
[ 281.396080] breth1: port 3(vif3.0) entered disabled state
and that's it. This is instead of the previously repeated 'draining TX
queue' messages.
Let's assume it won't crash, I'll let you know if this assumption turns
out to be wrong.
I'm kind of curious why this is preceded by
[ 46.232475] vif vif-3-0 vif3.0: Guest Rx ready
[ 46.232514] IPv6: ADDRCONF(NETDEV_CHANGE): vif3.0: link becomes ready
And the host figures out it's down only when traffic comes and doesn't
get through.
I guess this might change if I run 3.18 in the guest too?
I *think* this is the intended behaviour of "xen-netback: reintroduce
guest Rx stall detection", since the interface is down on the guest side
it becomes considered stalled (i.e not processing any packets).

The "link becomes ready" message I think refers to the backend end of
the connection, it's like a network cable only plugged in at one end or
something. Perhaps things could be smarter, but that would be an
upstream thing I think.
--
To UNSUBSCRIBE, email to debian-bugs-dist-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Gedalya
2014-11-08 13:50:02 UTC
Permalink
Post by Ian Campbell
Post by Gedalya
Post by Ian Campbell
Post by Gedalya
I suspect we will need to backport some xen-netback patch or other. I've
put some feelers out to see if any of the upstream devs have any
hints...
OK so if it's just a matter of changing a kernel on one box, I can
perhaps try to build a 3.18 this weekend
ecf08d2 xen-netback: reintroduce guest Rx stall detection
f48da8b xen-netback: fix unlimited guest Rx internal queue and carrier flapping
bc96f64 xen-netback: make feature-rx-notify mandatory
Ian.
Tried to just frankenport xen-netback from 3.18 into 3.16, didn't work
very well ;-)
Did you backport just the above or the full set of changes from 3.18?
I tried to "simplify" (avoid having to edit code myself..) by just
copying the full xen-netback from 3.18 as it is.
I did have to revert "c835a6 net: set name_assign_type in
alloc_netdev()" to get it to compile, but then it gave me a kernel bug
as soon as a xen guest booted up.
(see attached if it matters)
I'll try to apply just those 3 patches and see how it goes.
Post by Ian Campbell
Post by Gedalya
I'm running 3.18rc3+ now. Bombarding the downed interface by
broadcast-pinging the network it's on causes the following
[ 281.396014] vif vif-3-0 vif3.0: Guest Rx stalled
[ 281.396080] breth1: port 3(vif3.0) entered disabled state
and that's it. This is instead of the previously repeated 'draining TX
queue' messages.
Let's assume it won't crash, I'll let you know if this assumption turns
out to be wrong.
I'm kind of curious why this is preceded by
[ 46.232475] vif vif-3-0 vif3.0: Guest Rx ready
[ 46.232514] IPv6: ADDRCONF(NETDEV_CHANGE): vif3.0: link becomes ready
And the host figures out it's down only when traffic comes and doesn't
get through.
I guess this might change if I run 3.18 in the guest too?
I *think* this is the intended behaviour of "xen-netback: reintroduce
guest Rx stall detection", since the interface is down on the guest side
it becomes considered stalled (i.e not processing any packets).
The "link becomes ready" message I think refers to the backend end of
the connection, it's like a network cable only plugged in at one end or
something. Perhaps things could be smarter, but that would be an
upstream thing I think.
OK, makes sense. Thanks!
Gedalya
2014-11-08 20:20:01 UTC
Permalink
Post by Gedalya
Post by Ian Campbell
Post by Gedalya
Tried to just frankenport xen-netback from 3.18 into 3.16, didn't work
very well ;-)
Did you backport just the above or the full set of changes from 3.18?
I tried to "simplify" (avoid having to edit code myself..) by just
copying the full xen-netback from 3.18 as it is.
I did have to revert "c835a6 net: set name_assign_type in
alloc_netdev()" to get it to compile, but then it gave me a kernel bug
as soon as a xen guest booted up.
(see attached if it matters)
I'll try to apply just those 3 patches and see how it goes.
Important: I have no idea what I'm doing!!

So I cherry-picked the following
xen-netback: reintroduce guest Rx stall detection
xen-netback: fix unlimited guest Rx internal queue and carrier flapping
xen-netback: make feature-rx-notify mandatory
xen-netback: Don't deschedule NAPI when carrier off
xen-netback: Fix vif->disable handling
xen-netback: Turn off the carrier if the guest is not able to receive
xen-netback: Using a new state bit instead of carrier

I'm attaching the two commits for which I had to manually resolve
conflicts, and finally a debian quilt patch including all 7 commits for
3.16.7-2

So far this is working, behavior is as I described for 3.18.

Perhaps this could be helpful but someone should certainly review it.
Ian Campbell
2014-11-09 10:20:02 UTC
Permalink
Post by Gedalya
Post by Gedalya
Post by Ian Campbell
Post by Gedalya
Tried to just frankenport xen-netback from 3.18 into 3.16, didn't work
very well ;-)
Did you backport just the above or the full set of changes from 3.18?
I tried to "simplify" (avoid having to edit code myself..) by just
copying the full xen-netback from 3.18 as it is.
I did have to revert "c835a6 net: set name_assign_type in
alloc_netdev()" to get it to compile, but then it gave me a kernel bug
as soon as a xen guest booted up.
(see attached if it matters)
I'll try to apply just those 3 patches and see how it goes.
Important: I have no idea what I'm doing!!
:-D
Post by Gedalya
So I cherry-picked the following
xen-netback: reintroduce guest Rx stall detection
xen-netback: fix unlimited guest Rx internal queue and carrier flapping
xen-netback: make feature-rx-notify mandatory
xen-netback: Don't deschedule NAPI when carrier off
xen-netback: Fix vif->disable handling
xen-netback: Turn off the carrier if the guest is not able to receive
xen-netback: Using a new state bit instead of carrier
I'm attaching the two commits for which I had to manually resolve
conflicts, and finally a debian quilt patch including all 7 commits for
3.16.7-2
So far this is working, behavior is as I described for 3.18.
Perhaps this could be helpful but someone should certainly review it.
Thanks, I actually ended up backporting a few more patches, effectively
all of the netback changes since v3.16 since they all looked like useful
fixes, and it reduced the conflicts.

If you were able to test the kernel from
http://xenbits.xen.org/people/ianc/debian/767261/ that would be great
(I'm struggling a bit to regroove my usual test box with something
useful).

Ian.
--
To UNSUBSCRIBE, email to debian-bugs-dist-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Gedalya
2014-11-09 10:40:02 UTC
Permalink
Post by Ian Campbell
Post by Gedalya
Post by Gedalya
Post by Ian Campbell
Post by Gedalya
Tried to just frankenport xen-netback from 3.18 into 3.16, didn't work
very well ;-)
Did you backport just the above or the full set of changes from 3.18?
I tried to "simplify" (avoid having to edit code myself..) by just
copying the full xen-netback from 3.18 as it is.
I did have to revert "c835a6 net: set name_assign_type in
alloc_netdev()" to get it to compile, but then it gave me a kernel bug
as soon as a xen guest booted up.
(see attached if it matters)
I'll try to apply just those 3 patches and see how it goes.
Important: I have no idea what I'm doing!!
:-D
Post by Gedalya
So I cherry-picked the following
xen-netback: reintroduce guest Rx stall detection
xen-netback: fix unlimited guest Rx internal queue and carrier flapping
xen-netback: make feature-rx-notify mandatory
xen-netback: Don't deschedule NAPI when carrier off
xen-netback: Fix vif->disable handling
xen-netback: Turn off the carrier if the guest is not able to receive
xen-netback: Using a new state bit instead of carrier
I'm attaching the two commits for which I had to manually resolve
conflicts, and finally a debian quilt patch including all 7 commits for
3.16.7-2
So far this is working, behavior is as I described for 3.18.
Perhaps this could be helpful but someone should certainly review it.
Thanks, I actually ended up backporting a few more patches, effectively
all of the netback changes since v3.16 since they all looked like useful
fixes, and it reduced the conflicts.
If you were able to test the kernel from
http://xenbits.xen.org/people/ianc/debian/767261/ that would be great
(I'm struggling a bit to regroove my usual test box with something
useful).
Ian.
OK, that works. Got it to stall etc., uptime 6 minutes.. all good so
far. I'll let u know if anything interesting happens.
--
To UNSUBSCRIBE, email to debian-bugs-dist-***@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact ***@lists.debian.org
Loading...