Discussion:
A user process blocks hibernation - what?
(too old to reply)
Dave Howorth
2020-03-20 19:55:48 UTC
Permalink
On Fri, 20 Mar 2020 19:58:44 +0100
On Fri, 20 Mar 2020 13:21:17 +0100
The user run application 'mc' (Midnight Commander) was blocking
hibernation (and now I see there was another application named
'pool').
I tried to kill 'mc' with killall -9. It still refused. I
killed the terminal that had it, no way. In the end, I had to
poweroff the machine instead.
I think that mc was blocked because it had open a remote
and that other machine had been hibernated a minute before.
How can it be that a plebeian app stops the almighty kernel in
its tracks?
Both threads are in kernel mode and as you yourself said cannot
be interrupted. So there is little kernel can do.
Sorry, I still do not understand why a user process such as mc
can not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then
disabled interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only this
one was stuck. My training said that when interrupts were disabled,
noone got access to them.
I don't understand a word of that ?
You have one process which disabled interrupts whilst in some bit of
kernel code, maybe a driver, who knows. Disabling interrupts just
means a bit of code that must complete without any asynchronous calls
happening. Most probably to guarantee data integrity. It's perfectly
normal.
Right, but kernel code that suspends interrupts is not supposed to
persist indefinitely and should have been QAd by kernel devs, no?

Plus as Carlos says, since when has a network connection disappearing
been unexpected and have any effect on data integrity?
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Carlos E. R.
2020-03-21 13:18:14 UTC
Permalink
Interrupts are blocked per process. Very basic example: I might
have a daemon I want to ignore Ctrl-C (SIGINT), so I block it. Does
not mean any other process should also ignore it.
That's totally news to me. :-o
But control c is signal issued by the kernel, not a hardware
interrupt that some code has to handle, in order to read the hardware
keyboard interface. I'm talking of things like INT 01, INT 02, etc.
Some hardware puts a high voltage on a line, and the CPU halts
completely and jumps to a predefined address. Hardware.
Which has no bearing on the problem you have described. You said "I
tried to kill 'mc' with killall -9. It still refused. " - that means
SIGKILL has been disabled (only possible in kernel mode).
It is impossible to block interrupts for minutes
It is entirely possible to block interrupts for minutes.
And then the entire kernel goes kaput.
Uh, no. It just waits - as you have found out.
Not on hardware interrupts, which is what I was thinking about.

- --
Cheers
Carlos E. R.

(from openSUSE 15.1 (Legolas))
Per Jessen
2020-03-21 08:41:01 UTC
Permalink
Post by Dave Howorth
On Fri, 20 Mar 2020 19:58:44 +0100
On Fri, 20 Mar 2020 13:21:17 +0100
Sorry, I still do not understand why a user process such as mc
can not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then
disabled interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only this
one was stuck. My training said that when interrupts were disabled,
noone got access to them.
I don't understand a word of that ?
When one process blocks interrupts, my teachers told me it applied to
the entire computer, all processes. Nothing, not even the kernel, can
intervene. The keyboard gets blocked, the clock interrupts gets
blocked.
Interrupts are blocked per process. Very basic example: I might have a
daemon I want to ignore Ctrl-C (SIGINT), so I block it. Does not mean
any other process should also ignore it.
It is impossible to block interrupts for minutes
It is entirely possible to block interrupts for minutes.
I also do not understand how a user process can block interrupts, that
should be reserved to the kernel.
Because it was in kernel mode, probably in a kernel driver. In your
case, maybe some filesystem code.

A user process does not always remain in user mode, it needs services
from the kernel, for instance to do I/O.
--
Per Jessen, Zürich (10.7°C)
http://www.hostsuisse.com/ - virtual servers, made in Switzerland.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Carlos E. R.
2020-03-21 12:47:09 UTC
Permalink
Post by Per Jessen
A user process can enter kernel mode - this one did, and then
disabled interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only this
one was stuck. My training said that when interrupts were disabled,
noone got access to them.
I don't understand a word of that ?
When one process blocks interrupts, my teachers told me it applied to
the entire computer, all processes. Nothing, not even the kernel, can
intervene. The keyboard gets blocked, the clock interrupts gets
blocked.
Interrupts are blocked per process. Very basic example: I might have a
daemon I want to ignore Ctrl-C (SIGINT), so I block it. Does not mean
any other process should also ignore it.
That's totally news to me. :-o

But control c is signal issued by the kernel, not a hardware interrupt
that some code has to handle, in order to read the hardware keyboard
interface. I'm talking of things like INT 01, INT 02, etc. Some hardware
puts a high voltage on a line, and the CPU halts completely and jumps to a
predefined address. Hardware.
Post by Per Jessen
It is impossible to block interrupts for minutes
It is entirely possible to block interrupts for minutes.
And then the entire kernel goes kaput.

- --
Cheers
Carlos E. R.

(from openSUSE 15.1 (Legolas))
Per Jessen
2020-03-21 13:03:01 UTC
Permalink
Interrupts are blocked per process. Very basic example: I might
have a daemon I want to ignore Ctrl-C (SIGINT), so I block it. Does
not mean any other process should also ignore it.
That's totally news to me. :-o
But control c is signal issued by the kernel, not a hardware
interrupt that some code has to handle, in order to read the hardware
keyboard interface. I'm talking of things like INT 01, INT 02, etc.
Some hardware puts a high voltage on a line, and the CPU halts
completely and jumps to a predefined address. Hardware.
Which has no bearing on the problem you have described. You said "I
tried to kill 'mc' with killall -9. It still refused. " - that means
SIGKILL has been disabled (only possible in kernel mode).
It is impossible to block interrupts for minutes
It is entirely possible to block interrupts for minutes.
And then the entire kernel goes kaput.
Uh, no. It just waits - as you have found out.
--
Per Jessen, Zürich (10.9°C)
http://www.cloudsuisse.com/ - your owncloud, hosted in Switzerland.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Dave Howorth
2020-03-21 14:41:49 UTC
Permalink
On Sat, 21 Mar 2020 12:37:23 +0100
On Sat, 21 Mar 2020 09:28:41 +0100
Post by Dave Howorth
You have one process which disabled interrupts whilst in some
bit of kernel code, maybe a driver, who knows. Disabling
interrupts just means a bit of code that must complete without
any asynchronous calls happening. Most probably to guarantee
data integrity. It's perfectly normal.
Right, but kernel code that suspends interrupts is not supposed
to persist indefinitely and should have been QAd by kernel devs,
no?
No and maybe, in that order :-)
It _is_ supposed to suspend indefinitely, but usually not for very
long. (in the order of microseconds probably). Yes, it probably
has been QAed and shown to work fine.
Ah, I think I understand. When the term 'interrupt' is used, Carlos
and I think of a hardware capability. I gather you're thinking of an
emulated software capability.
Yes, I'm looking at it as being "sat" inside a process. Hardware
interrupts are usually not serviced by a process (kernel or user), but
by an interrupt handler which then queues whatever it is (for
processing). (I'm not sure how HPET interrupts are handled though).
Carlos' 'midnight commander' is just a process, accessing the fuse
filesystem that is mounted with sshfs. As it has disabled SIGKILL, it
must be in kernel mode. I think disabling SIGKILL can only be
interpreted to mean "this _must_ complete, to avoid corrupting data".
OK, I think the difficulty we've had is that you've been using the word
'interrupt' when you should have been using the word 'signal'.

That's the correct word according to
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
where it also notes:

"In fact, if SIGKILL fails to terminate a process, that by itself
constitutes an operating system bug which you should report."

So I think Carlos should open a bugzilla.
Post by Dave Howorth
Plus as Carlos says, since when has a network connection
disappearing been unexpected and have any effect on data
integrity?
A network filesystem mount ?
I have a number of systems running with root on NFS, root is always
mounted with "hard,intr". That means "wait forever" in the case of
loss of the connection.
But in that case the mount is not done by a user program (mc in
Carlos' case) via FUSE
A FUSE driver also has to use kernel services.
Going back to the very first post, I think the situation could have
been remedied by resuming the machine at 192.168.1.134. Now Carlos'
'mc' would have been able to complete the "must complete" code and
exit cleanly.
Yes, but that's the wrong answer. It might have been the remote system
broke or was destroyed, for example, so it cannot be restored. And it's
not what Carlos wants anyway. He wants his system to hibernate. And
specifically he wants to be able to kill the mc process. Maybe he's
assessed any data integrity issues and decided he doesn't care, or at
least that it's the least worst option.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Carlos E. R.
2020-03-21 19:38:05 UTC
Permalink
Post by Dave Howorth
On Sat, 21 Mar 2020 12:37:23 +0100
On Sat, 21 Mar 2020 09:28:41 +0100
...
Post by Dave Howorth
Ah, I think I understand. When the term 'interrupt' is used, Carlos
and I think of a hardware capability. I gather you're thinking of an
emulated software capability.
Yes, I'm looking at it as being "sat" inside a process. Hardware
interrupts are usually not serviced by a process (kernel or user), but
by an interrupt handler which then queues whatever it is (for
processing). (I'm not sure how HPET interrupts are handled though).
Carlos' 'midnight commander' is just a process, accessing the fuse
filesystem that is mounted with sshfs. As it has disabled SIGKILL, it
must be in kernel mode. I think disabling SIGKILL can only be
interpreted to mean "this _must_ complete, to avoid corrupting data".
OK, I think the difficulty we've had is that you've been using the word
'interrupt' when you should have been using the word 'signal'.
Yes, when I think of interrupt I do of the pin in the CPU with that name.
With variations: a single one, or one normal and another that can not be
masked, or numbered interrupts by writing a number in some bus, specific
or not, at the same time of after lifting the IRQ line. Not of the strange
concept that Microsoft used in MsDos with numbered software interrupts,
with support from the CPU. Could have been called predefined subrutiine
table or something. It confuses the hell out of me, sorry.
Post by Dave Howorth
That's the correct word according to
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
"In fact, if SIGKILL fails to terminate a process, that by itself
constitutes an operating system bug which you should report."
So I think Carlos should open a bugzilla.
Ok, will do, thanks, if the log survived. The machine is being migrated,
so the log may be in the new or the old machine, dunno. I can't access it
now.
Post by Dave Howorth
Post by Dave Howorth
Plus as Carlos says, since when has a network connection
disappearing been unexpected and have any effect on data
integrity?
A network filesystem mount ?
I have a number of systems running with root on NFS, root is always
mounted with "hard,intr". That means "wait forever" in the case of
loss of the connection.
But in that case the mount is not done by a user program (mc in
Carlos' case) via FUSE
A FUSE driver also has to use kernel services.
Going back to the very first post, I think the situation could have
been remedied by resuming the machine at 192.168.1.134. Now Carlos'
'mc' would have been able to complete the "must complete" code and
exit cleanly.
Yes, but that's the wrong answer. It might have been the remote system
broke or was destroyed, for example, so it cannot be restored. And it's
not what Carlos wants anyway. He wants his system to hibernate. And
specifically he wants to be able to kill the mc process. Maybe he's
assessed any data integrity issues and decided he doesn't care, or at
least that it's the least worst option.
There is no filesystem data integrity issue, from my point of view. The
terminal where mc was "running" had not been used for hours. There was no
activity.


The use case is simply I was going to sleep, I was sleepy already, and not
in the mood to fight a computer refusing to hibernate for 3 times in a
row, getting cold in my pijamas. So I issue the command on both machines,
as nearly the same time as keyboarding the command on both. The new
machine is faster and I typed there first, anyway, so it went down fast.

Meaning, at that time I'm not considering remembering what network
connections I may have opened. In fact, it is very possble there are ssh
sessions in any direction. I never care about them, unless I want the
history to be saved, I just hibernate. The next day the sessions are duly
dead.


Consider a laptop and clossing the lid. Would it be acceptable it not
going to sleep inmediately, and running the battery out? The kernel has to
suspend the machine no matter what, no excusses accepted. What if the
laptop goes into the backback and then catches fire? I'm not imagining
things, it has happened, albeit with Windows in the cases I heard.

It is not acceptable that a machine does not hibernate on order.


- --
Cheers
Carlos E. R.

(from openSUSE 15.1 (Legolas))
Dave Howorth
2020-03-21 20:21:45 UTC
Permalink
On Sat, 21 Mar 2020 20:38:05 +0100 (CET)
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Dave Howorth
On Sat, 21 Mar 2020 12:37:23 +0100
On Sat, 21 Mar 2020 09:28:41 +0100
...
Post by Dave Howorth
Ah, I think I understand. When the term 'interrupt' is used,
Carlos and I think of a hardware capability. I gather you're
thinking of an emulated software capability.
Yes, I'm looking at it as being "sat" inside a process. Hardware
interrupts are usually not serviced by a process (kernel or user),
but by an interrupt handler which then queues whatever it is (for
processing). (I'm not sure how HPET interrupts are handled
though).
Carlos' 'midnight commander' is just a process, accessing the fuse
filesystem that is mounted with sshfs. As it has disabled
SIGKILL, it must be in kernel mode. I think disabling SIGKILL can
only be interpreted to mean "this _must_ complete, to avoid
corrupting data".
OK, I think the difficulty we've had is that you've been using the
word 'interrupt' when you should have been using the word
'signal'.
Yes, when I think of interrupt I do of the pin in the CPU with that
name. With variations: a single one, or one normal and another that
can not be masked, or numbered interrupts by writing a number in some
bus, specific or not, at the same time of after lifting the IRQ line.
Not of the strange concept that Microsoft used in MsDos with numbered
software interrupts, with support from the CPU. Could have been
called predefined subrutiine table or something. It confuses the hell
out of me, sorry.
Post by Dave Howorth
That's the correct word according to
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
"In fact, if SIGKILL fails to terminate a process, that by itself
constitutes an operating system bug which you should report."
So I think Carlos should open a bugzilla.
Ok, will do, thanks, if the log survived. The machine is being
migrated, so the log may be in the new or the old machine, dunno. I
can't access it now.
Post by Dave Howorth
Post by Dave Howorth
Plus as Carlos says, since when has a network connection
disappearing been unexpected and have any effect on data
integrity?
A network filesystem mount ?
I have a number of systems running with root on NFS, root is
always mounted with "hard,intr". That means "wait forever" in
the case of loss of the connection.
But in that case the mount is not done by a user program (mc in
Carlos' case) via FUSE
A FUSE driver also has to use kernel services.
Going back to the very first post, I think the situation could have
been remedied by resuming the machine at 192.168.1.134. Now
Carlos' 'mc' would have been able to complete the "must complete"
code and exit cleanly.
Yes, but that's the wrong answer. It might have been the remote
system broke or was destroyed, for example, so it cannot be
restored. And it's not what Carlos wants anyway. He wants his
system to hibernate. And specifically he wants to be able to kill
the mc process. Maybe he's assessed any data integrity issues and
decided he doesn't care, or at least that it's the least worst
option.
There is no filesystem data integrity issue, from my point of view.
The terminal where mc was "running" had not been used for hours.
There was no activity.
The use case is simply I was going to sleep, I was sleepy already,
and not in the mood to fight a computer refusing to hibernate for 3
times in a row, getting cold in my pijamas. So I issue the command on
both machines, as nearly the same time as keyboarding the command on
both. The new machine is faster and I typed there first, anyway, so
it went down fast.
Meaning, at that time I'm not considering remembering what network
connections I may have opened. In fact, it is very possble there are
ssh sessions in any direction. I never care about them, unless I want
the history to be saved, I just hibernate. The next day the sessions
are duly dead.
Consider a laptop and clossing the lid. Would it be acceptable it not
going to sleep inmediately, and running the battery out? The kernel
has to suspend the machine no matter what, no excusses accepted. What
if the laptop goes into the backback and then catches fire? I'm not
imagining things, it has happened, albeit with Windows in the cases I
heard.
It is not acceptable that a machine does not hibernate on order.
Exactly so.
- --
Cheers
Carlos E. R.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Per Jessen
2020-03-21 14:52:14 UTC
Permalink
Post by Dave Howorth
On Sat, 21 Mar 2020 12:37:23 +0100
On Sat, 21 Mar 2020 09:28:41 +0100
Post by Dave Howorth
You have one process which disabled interrupts whilst in some
bit of kernel code, maybe a driver, who knows. Disabling
interrupts just means a bit of code that must complete without
any asynchronous calls happening. Most probably to guarantee
data integrity. It's perfectly normal.
Right, but kernel code that suspends interrupts is not supposed
to persist indefinitely and should have been QAd by kernel devs,
no?
No and maybe, in that order :-)
It _is_ supposed to suspend indefinitely, but usually not for very
long. (in the order of microseconds probably). Yes, it probably
has been QAed and shown to work fine.
Ah, I think I understand. When the term 'interrupt' is used, Carlos
and I think of a hardware capability. I gather you're thinking of
an emulated software capability.
Yes, I'm looking at it as being "sat" inside a process. Hardware
interrupts are usually not serviced by a process (kernel or user),
but by an interrupt handler which then queues whatever it is (for
processing). (I'm not sure how HPET interrupts are handled though).
Carlos' 'midnight commander' is just a process, accessing the fuse
filesystem that is mounted with sshfs. As it has disabled SIGKILL, it
must be in kernel mode. I think disabling SIGKILL can only be
interpreted to mean "this _must_ complete, to avoid corrupting data".
OK, I think the difficulty we've had is that you've been using the
word 'interrupt' when you should have been using the word 'signal'.
I guess I tend to think of signals causing interrupts, i.e. asynchronous
execution of code. Signals can be blocked. Yes, I use the two words
interchangeably, mea culpa.
--
Per Jessen, Zürich (7.8°C)
http://www.dns24.ch/ - free dynamic DNS, made in Switzerland.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Dave Howorth
2020-03-20 15:25:32 UTC
Permalink
On Fri, 20 Mar 2020 13:21:17 +0100
The user run application 'mc' (Midnight Commander) was blocking
hibernation (and now I see there was another application named
'pool').
I tried to kill 'mc' with killall -9. It still refused. I killed
the terminal that had it, no way. In the end, I had to poweroff
the machine instead.
I think that mc was blocked because it had open a remote directory
and that other machine had been hibernated a minute before.
How can it be that a plebeian app stops the almighty kernel in its
tracks?
Both threads are in kernel mode and as you yourself said cannot be
interrupted. So there is little kernel can do.
Sorry, I still do not understand why a user process such as mc can
not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then disabled
interrupts. I.e. it has to complete.
It sounds like a pretty severe bug to enter kernel mode, disable
interrupts and then wait for some network event? And indeed a problem
in the overall system architecture if it permits of such bugs!

Or am I missing something?
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Per Jessen
2020-03-20 18:58:44 UTC
Permalink
On Fri, 20 Mar 2020 13:21:17 +0100
The user run application 'mc' (Midnight Commander) was blocking
hibernation (and now I see there was another application named
'pool').
I tried to kill 'mc' with killall -9. It still refused. I killed
the terminal that had it, no way. In the end, I had to poweroff
the machine instead.
I think that mc was blocked because it had open a remote
and that other machine had been hibernated a minute before.
How can it be that a plebeian app stops the almighty kernel in
its tracks?
Both threads are in kernel mode and as you yourself said cannot be
interrupted. So there is little kernel can do.
Sorry, I still do not understand why a user process such as mc can
not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then
disabled interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only this one
was stuck. My training said that when interrupts were disabled, noone
got access to them.
I don't understand a word of that ?

You have one process which disabled interrupts whilst in some bit of
kernel code, maybe a driver, who knows. Disabling interrupts just
means a bit of code that must complete without any asynchronous calls
happening. Most probably to guarantee data integrity. It's perfectly
normal.
--
Per Jessen, Zürich (13.4°C)
http://www.hostsuisse.com/ - virtual servers, made in Switzerland.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Carlos E. R.
2020-03-20 18:26:43 UTC
Permalink
On Fri, 20 Mar 2020 13:21:17 +0100
The user run application 'mc' (Midnight Commander) was blocking
hibernation (and now I see there was another application named
'pool').
I tried to kill 'mc' with killall -9. It still refused. I killed
the terminal that had it, no way. In the end, I had to poweroff
the machine instead.
I think that mc was blocked because it had open a remote directory
and that other machine had been hibernated a minute before.
How can it be that a plebeian app stops the almighty kernel in its
tracks?
Both threads are in kernel mode and as you yourself said cannot be
interrupted. So there is little kernel can do.
Sorry, I still do not understand why a user process such as mc can
not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then disabled
interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only this one
was stuck. My training said that when interrupts were disabled, noone
got access to them.
It sounds like a pretty severe bug to enter kernel mode, disable
interrupts and then wait for some network event? And indeed a problem
in the overall system architecture if it permits of such bugs!
Or am I missing something?
Same here. I still think the kernel should have control of everything,
destroy all resources assigned to the process, and of course, keep the
power to hibernate. If a process can't, ask the user. Ok, f**k that process.


And then, there was the issue that apparently caused this: mc had a
directory opened, that happened to be a remote directory. The other
machine had hibernated and thus dissapeared from the network. Why be
stuck as unkillable? It is a normal life occurrence, for another
computer to disappear.

The local process should still respond, even if the remote machine is
gone and not responding.
--
Cheers / Saludos,

Carlos E. R.
(from 15.1 x86_64 at Telcontar)
Carlos E. R.
2020-03-21 19:17:41 UTC
Permalink
...
Ah, I think I understand. When the term 'interrupt' is used, Carlos
and I think of a hardware capability. I gather you're thinking of an
emulated software capability.
Yes, I'm looking at it as being "sat" inside a process. Hardware
interrupts are usually not serviced by a process (kernel or user),
but by an interrupt handler which then queues whatever it is (for
processing). (I'm not sure how HPET interrupts are handled though).
Carlos' 'midnight commander' is just a process, accessing the fuse
filesystem that is mounted with sshfs. As it has disabled SIGKILL,
it
must be in kernel mode. I think disabling SIGKILL can only be
interpreted to mean "this _must_ complete, to avoid corrupting data".
If the connection dies, it dies. So, end the whatever is doing no
matter what, there is no recovering.
You're guessing. What you describe works perfectly fine with NFS, for
instance.
IF the other machine goes up again. I was not going to restore that
machine.
Going back to the very first post, I think the situation could have
been remedied by resuming the machine at 192.168.1.134. Now
Carlos' 'mc' would have been able to complete the "must complete"
code and exit cleanly.
That's a terrible solution. In this case, I might have done it. What
if the other machine is remote?
Unless you have a way of waking it up remotely, don't hibernate it.
Power failure, reboot, maintenance... things happen.

The other machine is manned, and its owner decides to hibernate it that
moment, after hours of doing nothing and idling. Those are excuses, a
process has to cope with network failure.

- --
Cheers
Carlos E. R.

(from openSUSE 15.1 (Legolas))
Per Jessen
2020-03-21 08:28:41 UTC
Permalink
Post by Dave Howorth
On Fri, 20 Mar 2020 19:58:44 +0100
Sorry, I still do not understand why a user process such as mc
can not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then
disabled interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only this
one was stuck. My training said that when interrupts were disabled,
noone got access to them.
I don't understand a word of that ?
You have one process which disabled interrupts whilst in some bit of
kernel code, maybe a driver, who knows. Disabling interrupts just
means a bit of code that must complete without any asynchronous calls
happening. Most probably to guarantee data integrity. It's
perfectly normal.
Right, but kernel code that suspends interrupts is not supposed to
persist indefinitely and should have been QAd by kernel devs, no?
No and maybe, in that order :-)

It _is_ supposed to suspend indefinitely, but usually not for very long.
(in the order of microseconds probably). Yes, it probably has been
QAed and shown to work fine.
Post by Dave Howorth
Plus as Carlos says, since when has a network connection disappearing
been unexpected and have any effect on data integrity?
A network filesystem mount ?

I have a number of systems running with root on NFS, root is always
mounted with "hard,intr". That means "wait forever" in the case of
loss of the connection.
--
Per Jessen, Zürich (10.6°C)
http://www.dns24.ch/ - free dynamic DNS, made in Switzerland.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Per Jessen
2020-03-21 11:37:23 UTC
Permalink
On Sat, 21 Mar 2020 09:28:41 +0100
Post by Dave Howorth
You have one process which disabled interrupts whilst in some bit
of kernel code, maybe a driver, who knows. Disabling interrupts
just means a bit of code that must complete without any
asynchronous calls happening. Most probably to guarantee data
integrity. It's perfectly normal.
Right, but kernel code that suspends interrupts is not supposed to
persist indefinitely and should have been QAd by kernel devs, no?
No and maybe, in that order :-)
It _is_ supposed to suspend indefinitely, but usually not for very
long. (in the order of microseconds probably). Yes, it probably has
been QAed and shown to work fine.
Ah, I think I understand. When the term 'interrupt' is used, Carlos
and I think of a hardware capability. I gather you're thinking of an
emulated software capability.
Yes, I'm looking at it as being "sat" inside a process. Hardware
interrupts are usually not serviced by a process (kernel or user), but
by an interrupt handler which then queues whatever it is (for
processing). (I'm not sure how HPET interrupts are handled though).

Carlos' 'midnight commander' is just a process, accessing the fuse
filesystem that is mounted with sshfs. As it has disabled SIGKILL, it
must be in kernel mode. I think disabling SIGKILL can only be
interpreted to mean "this _must_ complete, to avoid corrupting data".
Post by Dave Howorth
Plus as Carlos says, since when has a network connection
disappearing been unexpected and have any effect on data
integrity?
A network filesystem mount ?
I have a number of systems running with root on NFS, root is always
mounted with "hard,intr". That means "wait forever" in the case of
loss of the connection.
But in that case the mount is not done by a user program (mc in
Carlos' case) via FUSE
A FUSE driver also has to use kernel services.

Going back to the very first post, I think the situation could have been
remedied by resuming the machine at 192.168.1.134. Now Carlos' 'mc'
would have been able to complete the "must complete" code and exit
cleanly.
--
Per Jessen, Zürich (10.1°C)
http://www.hostsuisse.com/ - virtual servers, made in Switzerland.
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Carlos E. R.
2020-03-21 13:00:16 UTC
Permalink
...
Ah, I think I understand. When the term 'interrupt' is used, Carlos
and I think of a hardware capability. I gather you're thinking of an
emulated software capability.
Yes, I'm looking at it as being "sat" inside a process. Hardware
interrupts are usually not serviced by a process (kernel or user), but
by an interrupt handler which then queues whatever it is (for
processing). (I'm not sure how HPET interrupts are handled though).
Carlos' 'midnight commander' is just a process, accessing the fuse
filesystem that is mounted with sshfs. As it has disabled SIGKILL, it
must be in kernel mode. I think disabling SIGKILL can only be
interpreted to mean "this _must_ complete, to avoid corrupting data".
If the connection dies, it dies. So, end the whatever is doing no matter
what, there is no recovering.

And in fact, it was doing nothing, that terminal had not been used in
hours.
Post by Dave Howorth
Plus as Carlos says, since when has a network connection
disappearing been unexpected and have any effect on data
integrity?
A network filesystem mount ?
I have a number of systems running with root on NFS, root is always
mounted with "hard,intr". That means "wait forever" in the case of
loss of the connection.
But in that case the mount is not done by a user program (mc in
Carlos' case) via FUSE
A FUSE driver also has to use kernel services.
Going back to the very first post, I think the situation could have been
remedied by resuming the machine at 192.168.1.134. Now Carlos' 'mc'
would have been able to complete the "must complete" code and exit
cleanly.
That's a terrible solution. In this case, I might have done it. What if
the other machine is remote?

But what if it is a laptop with a dying battery? If it refuses to
hibernate the battery goes and all data in all processes is lost, which is
much worse than a single mc process not exiting cleanly.

I see no excuses for not hibernating no matter what.

Poweroff succeded fast, it found no excuses to not power off. But of
course, all possible data in everything is lost.


- --
Cheers
Carlos E. R.

(from openSUSE 15.1 (Legolas))
Carlos E. R.
2020-03-21 12:53:01 UTC
Permalink
Post by Dave Howorth
On Fri, 20 Mar 2020 19:58:44 +0100
I don't understand a word of that ?
You have one process which disabled interrupts whilst in some bit
of kernel code, maybe a driver, who knows. Disabling interrupts
just means a bit of code that must complete without any
asynchronous calls happening. Most probably to guarantee data
integrity. It's perfectly normal.
Right, but kernel code that suspends interrupts is not supposed to
persist indefinitely and should have been QAd by kernel devs, no?
No and maybe, in that order :-)
It _is_ supposed to suspend indefinitely, but usually not for very
long. (in the order of microseconds probably). Yes, it probably has
been QAed and shown to work fine.
Ah, I think I understand. When the term 'interrupt' is used, Carlos and
I think of a hardware capability. I gather you're thinking of an
emulated software capability.
Indeed I think of hardware. I'm basically a hardware guy, my training is
in electronics.
Post by Dave Howorth
Plus as Carlos says, since when has a network connection
disappearing been unexpected and have any effect on data
integrity?
A network filesystem mount ?
I have a number of systems running with root on NFS, root is always
mounted with "hard,intr". That means "wait forever" in the case of
loss of the connection.
But in that case the mount is not done by a user program (mc in Carlos'
case) via FUSE
The "mount" was done outside of 'mc' using sshfs, because 'mc' internal
method has been broken for years. Still, fuse and userland.

Maybe had I thought of it, I might have killed the sshfs process instead.

- --
Cheers
Carlos E. R.

(from openSUSE 15.1 (Legolas))
Dave Howorth
2020-03-21 11:15:15 UTC
Permalink
On Sat, 21 Mar 2020 09:28:41 +0100
Post by Dave Howorth
On Fri, 20 Mar 2020 19:58:44 +0100
Sorry, I still do not understand why a user process such as mc
can not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then
disabled interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only
this one was stuck. My training said that when interrupts were
disabled, noone got access to them.
I don't understand a word of that ?
You have one process which disabled interrupts whilst in some bit
of kernel code, maybe a driver, who knows. Disabling interrupts
just means a bit of code that must complete without any
asynchronous calls happening. Most probably to guarantee data
integrity. It's perfectly normal.
Right, but kernel code that suspends interrupts is not supposed to
persist indefinitely and should have been QAd by kernel devs, no?
No and maybe, in that order :-)
It _is_ supposed to suspend indefinitely, but usually not for very
long. (in the order of microseconds probably). Yes, it probably has
been QAed and shown to work fine.
Ah, I think I understand. When the term 'interrupt' is used, Carlos and
I think of a hardware capability. I gather you're thinking of an
emulated software capability.
Post by Dave Howorth
Plus as Carlos says, since when has a network connection
disappearing been unexpected and have any effect on data
integrity?
A network filesystem mount ?
I have a number of systems running with root on NFS, root is always
mounted with "hard,intr". That means "wait forever" in the case of
loss of the connection.
But in that case the mount is not done by a user program (mc in Carlos'
case) via FUSE
--
To unsubscribe, e-mail: opensuse+***@opensuse.org
To contact the owner, e-mail: opensuse+***@opensuse.org
Carlos E. R.
2020-03-20 20:05:58 UTC
Permalink
Post by Dave Howorth
On Fri, 20 Mar 2020 19:58:44 +0100
On Fri, 20 Mar 2020 13:21:17 +0100
The user run application 'mc' (Midnight Commander) was blocking
hibernation (and now I see there was another application named
'pool').
I tried to kill 'mc' with killall -9. It still refused. I
killed the terminal that had it, no way. In the end, I had to
poweroff the machine instead.
I think that mc was blocked because it had open a remote
and that other machine had been hibernated a minute before.
How can it be that a plebeian app stops the almighty kernel in
its tracks?
Both threads are in kernel mode and as you yourself said cannot
be interrupted. So there is little kernel can do.
Sorry, I still do not understand why a user process such as mc
can not be destroyed on order. No excuses.
A user process can enter kernel mode - this one did, and then
disabled interrupts. I.e. it has to complete.
Disabled interrupts? But all the processes were working, only this
one was stuck. My training said that when interrupts were disabled,
noone got access to them.
I don't understand a word of that ?
When one process blocks interrupts, my teachers told me it applied to
the entire computer, all processes. Nothing, not even the kernel, can
intervene. The keyboard gets blocked, the clock interrupts gets blocked.

It is impossible to block interrupts for minutes - and the entire
machine was responsive - except a single process.

I also do not understand how a user process can block interrupts, that
should be reserved to the kernel.
Post by Dave Howorth
You have one process which disabled interrupts whilst in some bit of
kernel code, maybe a driver, who knows. Disabling interrupts just
means a bit of code that must complete without any asynchronous calls
happening. Most probably to guarantee data integrity. It's perfectly
normal.
Right, but kernel code that suspends interrupts is not supposed to
persist indefinitely and should have been QAd by kernel devs, no?
Plus as Carlos says, since when has a network connection disappearing
been unexpected and have any effect on data integrity?
Right.
--
Cheers / Saludos,

Carlos E. R.
(from 15.1 x86_64 at Telcontar)
Loading...