Unusual experiences you have encountered while debugging ?

- S
- Simon Clubley
  
  Contact options for registered users
posted
8 years ago

Fri, May 29, 2015 1:18 AM

[or the day I looked into the face of Hell. :-)]

And now for something a little different. :-)

What experiences have you had during debugging an embedded system that make you really wonder out loud what the hell was going on ?

I've been building a homebrew programmer for the PIC32MX and have hit some major issues due to the really lousy Microchip programming specification.

At one point, just to see what happened, I decided to read virtual address 0x00000000 which the datasheet says is unmapped. This is what I got back:

[snip]

Identifying device attached to programmer Device reports id = 04a00053 prog_read_block: base_address = 00000000, length = 16

0000: 48 65 6c 6c 48 65 6c 6c 48 65 6c 6c 48 65 6c 6c HellHellHellHell

After a minute or so I realised that a mundane set of circumstances (PIC32MX mapping in an alleged unused address space, problems with the specification supplied read memory sequence after the first longword, realising the burner part of my programmer was actually working, etc) had combined to create the above illusion.

However, for a minute or so I started to think Microchip had filled the unused address space with an "interesting" pattern. Oops. :-)

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 3:53 AM

Most memorable was late 70's developing an 8085-based device (we didn't call it "embedded systems" back then).

We had exactly one prototype. Plastic case was built from milled pieces of lexan, bonded together and painted. Mechanisms were all "one off" hand made. Ditto electronics, etc. (I think we pilfered a power supply from one of our existing products).

Burned a set of EPROMs (Yippee! 2K byte devices! No more 1702's!!). Closed the lid -- carefully. Hit the power switch...

, *Bang!*

"WTF???"

Technician had placed a Black Cat with nichrome wire across the power supply for the "Bang!" -- and a flashbulb for the "".

He took great pleasure in commenting about how shook up I was!

Then, I drew his attention to the fact that the machine wasn't powering up: "Ooops!"

Suddenly, *he* was the one who was shook up! (How to explain to the boss that his practical joke had cost us THE prototype! :> )

When I worked on The Reading Machine, one of the basic tests we would do while bringing the system up was to push phonemes at the speech synthesizer to verify the data path was intact, synthesizer functional, amplifier, etc. These were all incredibly short bits of code because you had to "bit switch" them into core (minicomputer-based). So, you just had a crib sheet of octal codes that you'd quickly toggle into the machine, hit RUN and watch (listen) what happens.

The "stock" test just pushed a single phoneme code in at four different inflection levels. Sort of like: "ah, Ah, AH, *AH*" in an endless loop.

That, of course, is boring.

One day, we had the bankers coming in to appraise our assets (loan, I guess). A working machine (end product) is worth a helluvalot more than a bunch of components! So, big push to get all the "inventory" into a salable state!

Banker guy (?) wandered into our building to look things over. Machines all over the place (these are minis so they are pretty sizable... roughly as big as a dishwasher, etc.). Hallways, offices, lab, workshop, etc.

Then, the inevitable question: "Do these all work?" (obvious reason behind that!). Boss kinda cringed a bit and said "Yes". "Can I see one?" (beads of sweat on the boss's brow...) "Sure".

Boss of course had no idea what the *actual* state of each individual machine happened to be. We were freely swapping parts from machines to get as many units "up" as possible.

And, if he had steered the banker to a *particular* machine, that would have looked suspicious! (i.e., "Why can't you show me THIS machine, RIGHT HERE?")

So, he reached down and flipped the power switch. The core-resident code immediately started to run (none of this "boot delay" you see with modern machines). Chance had it that I had been working on that machine some time previous. And, the last test I had apparently performed was the synthesizer test. So, the machine immediately began pushing the phoneme codes: F UH2 K Y1 IU U1 F UH2 K Y1 IU U1 F ...

Of course, the banker had very little experience with synthetic speech (recall, this is late 70's) so he was "squinting" (why do people squint when trying to HEAR something??) trying to understand what this "noise" meant. Damn near everyone else in the building had no trouble sorting it out!! Boss was cherry red.

When he finally caught on, he laughed *so* hard... and forgot all about the fact that he had NOT seen the machine demonstrated as "operational"!

- S
- Simon Clubley
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 4:15 PM

Hello Don,

If I had done that, I would have expected to have been fired. :-)

I'm young enough to have missed the machines which needed a full bootstrap routinely keyed into them, but old enough to have run across (as a student) machines with a full console front panel.

So yes, I understand the _strong_ desire to have kept this stuff short. :-)

BTW, I think it also makes you reflect that you have knowledge and experience of a way of doing things that today's newcomers will never experience - at least it does for me.

This even shows up in silly little ways; for example, I sometimes miss the ability to physically write protect a drive in certain situations.

I also suspect that my code is tighter as a result of growing up on more resource limited machines.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- J
- John Devereux
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 4:58 PM

The classic (and safer) one was to have a long length of pneumatic tubing leading into the back of a rack. You blew cigarrete smoke into the far end of the tube while your esteemed colleague was working on the rack...

The other funny one was when we finally got our controller prototype working. It had a 8748 microcontroller sequencing a pneumatic machine, motor etc. We set up a camera to take a picture of it running, the camera flash goes off and *bang*, the machine locks up.

We put the silver sticker over the EPROM window after that.

--

John Devereux

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 5:32 PM

For the most part, I've been fortunate to work with people who "didn't take themselves too seriously". This, IME, makes a huge difference in how "creative" people can get in their solutions... less worried about failing or "doing something that, in hindsight, was obviously pretty 'stupid'". OTOH, ripe for coming up with really

*clever* approaches to problems that "less inspired" designs would stumble on. Not the sort of environment for folks with big egos.

The "normal" application was obviously too long to bit-switch in like this. A tiny bipolar ROM (I think 16x16 -- or maybe 32x16?) did the normal bootstrap... which loaded the image from a "data cassette" (the "Compact Cassette" format that was popular for music, at the time). Once loaded (into *core*), it was persistent, of course. So, subsequent power-ups just caused the code to start running immediately (cassette load was pretty slow).

The biggest take-away is learning to *think* about a problem before just flailing away at it: "Let me try this, recompile... nope, that wasn't it!" I think a lot of bugs creep into code because people only partially think through their proposed remedies -- it's too easy to just make a change, recompile and see the code (*appear*!) to work... then, move on as if that problem was solved. As if each problem was nothing more than a "typo".

At one point, I was working for a firm that had subcontracted some defense work from big blue. I was responsible for debugging the "processor" in the device.

We got a new device and their engineer came to help get the first machine up and running. A "Series 1" minicomputer was used to drive the test harness. The comms path (hardware) between the S1 and UUT was physically long (30 or 40 feet) and had to go through various gyrations to get to the proper logic levels, etc.

*LOTS* of one-shots (though they don't like calling them that!) to account for delays in various level translators, etc. This one triggers that one which, in turn, triggers this OTHER one, etc.

At one point, we couldn't get the two devices to communicate. I was convinced the problem was an insufficient delay in one stage of the "one-shot chain". Their engineer sat down, did the math and convinced himself that this was NOT the problem. So, dismissed the idea and went chasing other possible problems "on paper".

Never one to blindly "defer to my elders", I just walked off, grabbed a honking BIG capacitor that was lying on a nearby bench (without concern for it's actual *rating*), held it across the timing capacitor for the one-shot that I suspected and, voila! Everything started working!

"What did you just do??"

I showed him the cap. His eyes went wide when he saw that it was like 1000 times larger than the circuit required...

"Well, that's way too big!"

"Yes, I know. But, obviously, the one that's *in* there isn't big enough! Now, we can sort out why that's the case! (wrong component installed? tolerances? some other issue that the design failed to take into consideration?)

The current approach to much debugging seems to be "slap the big capacitor in the circuit and, if it works, LEAVE IT THERE!"

Possible with most of my SCSI drives (via a jumper). The issue is then whether the OS will gag when it encounters this restriction!

We used to code with the KNOWLEDGE/ASSURANCE that the executable would be installed in R/O memory. E.g., using 16rFF as a terminator because it could easily be checked (with an "increment the byte that this register is pointing at" opcode).

When we started building SRAM modules to *emulate* EPROMs, we had to include a "write protect" switch to ensure the SRAM behaved *like* an EPROM once the software image was installed. You quickly learned that failing to flip the switch caused your code to get clobbered really quickly! ("Hmmm... why are the data in all of these memory locations exactly +1 from what they *should be?")

The "attitude" also extends to other aspects of design, beyond "software".

E.g., a medical device I designed many years ago had to maintain an internal database that would be served up via a pair of serial ports and a query language that I had designed. At the time, DRAM was small (16Kx1, 64Kx1) and EXPENSIVE! Stuffing 64K devices would add considerably to the cost. Yet, restricting the design to 16K devices could later require a redesign of the PCB and/or software.

My solution was to stuff 16K parts -- but, allow any or all of them to be replaced with 64K parts. And, the software treated the first 16K of that address space as "complete words"; but, all addresses beyond that were treated as "N-(possibly non-contiguous)-bit wide".

During POST, the system would clarify the types of memory devices present in each "bit position" -- in effect, creating a mask that indicated where the bits were valid in this "beyond 16K space". All accesses to the "data store" would occur through: result_t get_word(addr_t address, word_t &word) result_t put_word(addr_t address, word_t &word) accessors. Of course, much slower than doing a memory cycle on a specific address! But, infinitely faster than the data rate that the query interface encountered (serial ports).

OTOH, I can recall another early design where the hardware guy had opted to save the cost of a shift register -- forcing the software to do shift-store cycles in a tight loop. It was *embarassing* to see how much that "savings" COST the design!

[hardware and software folks tend not to overlap, IME]

The problem with this sort of mindset is that it is REALLY hard to shake! I recall designing an interface to a PROM programmer and found myself INSTINCTIVELY writing (in C) things like: put_nybble(...) {put value & 0x0F} put_byte(...) {put_nybble; put_nybble} put_word(...) {put_byte; put_byte} put_long(...) {put_word; put_word} without thinking about whether this was *necessary* or *clear*!

I'm now working in a resource rich (more like "resource gluttony!") environment and it is REALLY hard to discipline myself not to micro-manage aspects of the design: "Burn a few million cycles, who cares! Use them to improve reliability and ease maintenance efforts!"

I think going from scarce to plentiful is considerably easier than trying to do it the other way around. I suspect most folks who write code for desktops haven't a clue as to maximum stack penetration, etc. They just tweek things until they *appear* to work -- and hope that they have encountered (purely by CHANCE!) the worst case scenario at some point "on the bench" (instead of designing *for* it!). So, "flukes" just get shrugged off -- instead of explored in detail: "That SHOULDN'T happen! So, why *did* it? (you saw it, too, didn't you??)"

- R
- Reinhardt Behm
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 5:49 PM

I did a lot of control systems with pneumatics and often tuned the timing to have nice rhythm. :-)

Reminds me when I had a CPU (in the old Z80 days) board that did not work correctly. Took disconnected it on my desk and tried to measure some connections with an ohm-meter and weird results of the kind you get when doing this with power applied. Switched to voltage measurement and measured something like 0.5..1V on the unpowered board. Put my head closer and the voltage was gone. 1N4180 with glass cases make nice photo elements :-)

--
Reinhardt

- T
- Tom Gardner
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 6:13 PM

You can still do that with the RPi, as noted just about everywhere including

formatting link

- S
- Simon Clubley
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 9:29 PM

It was the destroying the only prototype because you were fooling around that I was reacting to.

If you destroy the only prototype because you tried something work related after thinking it through and it didn't work, then that's more "don't _ever_ do that again!" territory. :-)

Ok Don, you are now making me feel old - I've also done the load programs from cassette routine. :-)

Yes, but I want a set of nice buttons on the front panel (one for each drive) marked "Write Protect". :-)

BTW, at least for Linux, I sometimes mount partitions read-only when doing some things and it works ok, so I would assume most of the support is already there.

I'm certainly stronger with software than I am with hardware.

I've done a similar thing with my portable UART/formatted output library I wrote for bare metal programming (ranging from small 8-bit MCUs to

32-bit devices).

I didn't want any redundant code in the library or final executable so I structured it in the same way as above (making more complex functions out of lower level functions) and made sure each function was in it's own source/object module so it wouldn't get pulled in unless actually needed.

And yes, the above design pattern looks clear to me.

Based on some things I have read, and which I have experienced to some degree, I would strongly agree with this.

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Fri, May 29, 2015 10:03 PM

Shit Happens. There's nothing the boss could do (e.g., fire him) that would be more punishment than his having to face us each day thereafter.

The parts that are hard to reproduce are the mechanisms, case, etc. Software can be recreated just by burning new ROMs. Power supply is just pull another assembly from manufacturing. So, the only real stuff at risk is the "processor board" -- relatively easy to wirewrap a new one, etc.

I was working on a prototype for a KWHr meter (the gizmo you have on the outside of your home/business that measures "power" consumption for billing purposes with the public utility). Slipped with a scope probe and watched the prototype burst into flames. (sigh)

As I said, its better to learn not to take oneself too seriously. I've seen shops where folks frowned on any sort of "antics". The sorts of places where the place buttons itself up TIGHT at 5:00:00.00 on the dot -- "it's just a job". It's much more pleasant working at the places where folks feel more comfortable around each other.

[I once fell asleep with a 'scope probe in my hand (we were putting in VERY long hours). The guy I was working with just took it from my hand and kept working. And gave me a wicked grin when I finally "came too": "Have a nice nap?"]

From *data* cassettes? Or, from *audio* cassettes (Kansas City Standard)?

I've seen some wonky little "tape drives" from my Brit friends that leave me wondering about whether someone went a bit too far trying to "economize" on the hardware...

I had a tall tower many years ago that I did this with. I put toggle switches on the back panel that I could just flip on/off for each of the four drives within.

Yes -- and in single-user mode it's the de facto condition. But, I've never tried to see what would happen if the mount(8) command that I typed (or implicitly relied on via [v]fstab(5)!) wasn't consistent with the position of the toggle switch.

It's other OS's that get touchy (e.g., Windows).

I find there are a lot of assumptions in most desktop OS's that, when violated, cause all sorts of grief!

(E.g., yesterday, I had an optical drive fail. Windows simply refused to shut down -- no doubt hung in a driver call that it didn't have the ability to abort!)

My software looks like hardware mechanisms and my hardware often looks like *software*. E.g., I've never had any problem dealing with multitasking, true parallelism, etc. -- because I can imagine bits of hardware that operate concurrently without the need for serialization.

Often, this duality is an effective way of explaining implementations (hardware or software) to folks with "the other" (software or hardware) background.

What I don't understand is, you KNOW it "can't happen"; you know it

*DID* happen ("You saw it too, didn't you?"). Doesn't this cause you to question what else you've misunderstood? It's not like some *user* is reporting some observation that you can dismiss on the assumption that users are notoriously bad at accurately reporting detail. Don't you WANT to know what was wrong with your ASSUMPTION?

Party this evening. I should probably go make myself presentable...

(sigh) Waste of time.

- S
- Simon Clubley
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, May 30, 2015 3:40 PM

Hello Don,

Good point. :-) The Compact Cassettes in question were sold as audio cassettes.

Well, we are a nation of inventors. :-)

Sometimes inventing stuff means you look back afterwards and either think that was a really good idea or (sometimes) wasn't my greatest ever idea... :-)

Simon.

--
Simon Clubley, clubley@remove_me.eisner.decus.org-Earth.UFP 
Microsoft: Bringing you 1980s technology to a 21st century world

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, May 30, 2015 4:20 PM

I suspected. These were special data cassettes. IIRC, a clock track was prerecorded on the media. The tapes would go bad frequently.

You often learn more from mistakes than successes!

- R
- rickman
  
  Contact options for registered users
Vote on answer
posted
8 years ago

Sat, May 30, 2015 8:00 PM

I worked on a design once to pull data from a tape like this. It had a clock track which was solely to control the tape movement and the data was Manchester encoded. The company I worked for "inherited" the design from someone (perhaps a government facility) who wanted it to be sold commercially. The design wasn't bad, but they forgot a few things that you just don't do in production like leave TTL inputs floating. It would have very intermittent errors from a FF being reset randomly (open reset input). That took a while to figure out.

I remember learning about CMOS being sensitive to static discharge. I don't recall how I picked up a charge, but I zapped a board of mixed TTL and CMOS. When I debugged it I found that nearly every CMOS chip was zapped while the TTL was all good. This was *long* before anyone was wearing wrist straps for static control.

--

Rick