spin locks debugging

- M
- Mark
  
  Contact options for registered users
posted
14 years ago

Thu, Dec 17, 2009 5:02 AM

Hello

I have an old product to maintain, based on ucLinux-2.4.20, MMU-less ARM processor. Periodically the system just reboots without any diagnostic messages. I suspect spinlocks as one of sources of the problem, therefore I've enabled DEBUG_SPINLOCKS=2 in include/linux/spinlock.h.

Now the kernel reboots at boo-up time, investigations have shown it happens somewhere in 'rest_init()' in init/main.c. The code of the function as follows:

static void rest_init(void) { kernel_thread(init, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL); unlock_kernel(); current->need_resched = 1; cpu_idle(); }

First three statements are invoked, and either in 'cpu_idle' or after the system reboots. Am I doing something wrong? Isn't this a suitable mechanism to debug?

PS. Setting DEBUG_SPINLOCKS=1 works fine, but doesn't provide full debugging capabilities.

--
Mark

- R
- Rainer Weikusat
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Thu, Dec 17, 2009 9:50 AM

Unless your kernel is compiled for multi-processor support, no spinlock-related code will be in it.

Nothing is ever executed after cpu_idle. This is the co-called 'idle thread' which is supposed to do nothing in a loop (literally) when no other kernel scheduled entity (thread, kernel_thread, 'process') is runnable. It is not entirely unconceivable that your hardware has a problem with the architecture-specific idling-code executed from cpu_idle. Otherwise, the problem is possibly external. The usual cause (according to my experience) of spontaneous reboots are unstable power supplies: If power is gone for a small amount of time, the system will boot afterwards.

- M
- Mark
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Dec 18, 2009 12:55 AM

So it is pointless to use 'spinlocks' on a uniprocessor system?

void cpu_idle(void) { ... while (1) { void (*idle)(void) = pm_idle; if (!idle) idle = arch_idle; ... }

I have search through the entire kernel tree in 'arch/armnommu' and have not found any platform specific 'arch_idle' definitions, even for platform already ported and in the maintree.

The problem is that the board reboots only when some specific protocol is activated (it's a network device), therefore my suspect on the code, rather hardware.

--
Mark

- M
- Mark
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Dec 18, 2009 1:16 AM

Sorry, I was uncareful. The function in concealed in 'include/asm-armnommu/arch-myarch/system.h':

static inline void arch_idle(void) { while (!current->need_resched && !hlt_counter); }

--
Mark

- M
- Mark
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Dec 18, 2009 2:31 AM

Macro 'spin_lock_irqsave(lock, flags)' gets down to:

do { local_irq_save(flags); (( unsigned long temp; __asm__ __volatile__( "mrs %0, cpsr @ save_flags_cli\n" " orr %1, %0, #128\n" " msr cpsr_c, %1" : "=r" (flags), "=r" (temp) : : "memory"); ))

(void) (lock); } while (0)

while spin_unlock_irqrestore() simply restires the flags.

So it means whenever I use spin locks on a uniprocessor system, it only disables/enables interrupts.

--
Mark

- K
- Kaz Kylheku
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Dec 18, 2009 2:39 AM

Yes but note that rest_init is launching a thread, which calls the init() function to complete the initialization. The system may be dying in that thread. Put some printk's into that function.

- D
- David Schwartz
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Dec 18, 2009 6:05 AM

No, but they won't actually spin.

DS

- R
- Rainer Weikusat
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Fri, Dec 18, 2009 6:06 PM

"Mark" writes:

Only for the *_irq*-variants. The others just do nothing. The purpose of a spin lock is to provide mutex-semantics for code which must not sleep (link itself onto a waitqueue and call the scheduler to cause a different task to be scheduled), that is, code running in so-called 'interrupt context', which is executed autonomously by the kernel in response to external events (interrupts), as opposed to code running in 'process context' which is executed by some process/ thread which has made a system call (and may sleep), and for code whose execution must be serialized wrt other code running in interrupt context. On an uniprocessor, it is sufficient to disable interrupts to achieve this mutual exclusion because this guarantees that no other kernel code suddenly starts to be executed. On a multiprocessor, the possibility that some other CPU/ core/ hyperthread/ $whatever executes conflicting kernel code exists. It would be possible to achieve mutual exclusion on a multiprocessor by disabling interrupts 'globally', meaning, for all CPUs/ ... which exist in the system but this is a really expensive operation because it basically halts everything 'just in case' and even 'just in an improbable case' because it is usually desirable that lock contention is low, IOW, the chances that another $whatever will be executing conflicting code should be slim except in pathological cases. That's were the 'spin' part comes into play: This refers to a busy-waiting loop which another processor will execute until the 'spin lock' is released by the code which presently holds it. This still affects all processors in the system, because of the atomic memory access operations necessary to implement the lock and a spin lock more than one processor wants to acquire at the same time will cause the corresponding cache line to bounce back and forth among the processors which desire to own the lock (if there is only one processor waiting for it, this processor can happily spin along for as long as the cacheline belongs to it exclusively) but at least, this happens only if there is actual contention for the lock.

For obvious reasons, an interrupt handler executing on the 'local' CPU cannot 'spin' until the code it interrupted has released a spin lock.

- K
- Kaz Kylheku
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sat, Dec 19, 2009 8:39 AM

That is the case on a non-preemptible kernel. But on a preemptible kernel, non-irq spinlocks have to do something: namely disable preemption.

This is completely wrong.

A spinlock is simply a fast mutual exclusion primitive. It is not a primitive that is dedicated to interrupts (but, obviously, the augmented interrupt spinlock extends spinlocks to interrupt context).

Processes that hold a spinlock must not sleep or be preempted for the simple reason that this would cause the waiting processes (which are spinning to acquire the lock) to /massively/ bleed CPU time, in a way only rivaled by Windows operating systems.

This has little to do with the reasons why interrupt context can't sleep. I.e. yes, interrupt context can hold a spinlock, and interrupt context cannot sleep. But this is not where the rule comes from that a processor can't sleep while holding a spinlock. Sleeping is forbidden even when holding a /non/-interrupt spinlock!

Process context may not sleep when holding a spinlock (irq or regular).

Non-irq spinlocks are used to efficiently serialize among processors (in a non-preemptive SMP kernel).

The irq part extends the usefulness of spinlocks to interrupt context; it basically combines two independent locks into one.

IRQ disabling provides exclusion between a processor and its interrupts, but not against other processors. A spinlock provides exclusion against other processors (efficiently, if combined with forbidden sleeping and disabled preemption), but not against being interrupted. So: they are combined together in the irq spinlock.

No, it wouldn't. Disabling interrupts on other processors would not stop them from running non-interrupt-context code which could be a critical region. Doh?

The disabling interrupts model does not extend to processors, because they are truly concurrent, and interrupts are not. You cannot model the behavior of the other processors, running concurrently with this one, as if they were interrupts.

The interrupt cannot spin, because it isn't happening. There is no interrupt. If local the CPU holds an irq spin lock, then interrupts are disabled, remember?

An interrupt can, of course, spin on the lock---if another CPU has it.

- R
- Rainer Weikusat
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Dec 20, 2009 8:42 PM

Strictly speaking, no. If kernel preemption was compiled into the kernel (doesn't exist for 2.4), in addition to disabling interrupts, preemption also needs to be disabled to ensure exclusive execution. Since this is always necessary when using a spin lock, the code to do so was added to the corresponding routines. But this doesn't affect the OP since his kernel doesn't support preemption (at least not if it wasn't included explicitly) and still doesn't mean that actual 'spin lock operations' would be performed.

I didn't write that it was dedicated to interrupts. I wrote that its purpose would be to provide mutex semantics for code which must not sleep and hence, cannot use a semaphore or ordinary mutex. This is usually code running in interrupt context.

That's a pretty obvious consequence of the busy-waiting.

[...]

I didn't write anything about this since I (see above) considered this to be obvious. Code running in process context, such as different processes using the same driver, is allowed to sleep and hence, can and usually does, use locks where 'sleeping' is an option.

There is no such thing as 'an irq spinlock'. For convenience, spin lock locking and unlocking calls exist which also disable interrupts on the local CPU because this is necessary to 'lock out' code running in interrupt context.

Indeed.

[...]

You really seem to suffer from some kind of strange 'inverted cause and effect syndrome': As I wrote: Interrupts need to be disabled on the local CPU despite using a spin lock because otherwise, an interrupt handler could try to acquire the same spin lock and this would take a loooong time.

- R
- Rainer Weikusat
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Sun, Dec 20, 2009 9:24 PM

[...]

Before someone trips over this: Since interrupts must be disabled on the local CPU if locking out other interrupt handlers locally is necessary, this is, of course, meant to refer to 'mutual exclusion wrt code running on other CPUs, be it interrupt handlers or anything else'.

- R
- Rainer Weikusat
  
  Contact options for registered users
Vote on answer
posted
14 years ago

Mon, Dec 21, 2009 12:03 PM

If this results in a hithertho dormant piece of hardware becoming active (eg, a network interface), it is completely possible that the power consumption suddenly increases. I had a very similar situation here with an 802.11 interface: Whenever that was started, the board would reboot (so I thought) after a while. Fortunately, the defective power supply died completely a short time afterwards and after replacing it, the problem disappeared.