Code metrics


Any pointers to research that can shed light on the robustness/vulnerability of *particular* code metrics in the context of modeling:

- effort required

- "code correctness" (non-bugginess)

And, in practice (for shops regularly *using* metrics in their development/test models), whether the type of metric employed has a measurable impact on the coding styles of developers (consciously or subconsciously). I.e., do they try to "game" the system?



Reply to
Don Y
Loading thread data ...

I don't know if things have changed in the last decade or so, but the last time I really paid attention to this, it was felt that _any_ code metric could be gamed. Search on "Dilbert" and "write me a new mini-van".

There's one that measured overall code complexity that my 2nd favorite manager of all time really put a lot of reliance on -- and he was my absolute favorite manager when it came to software issues and developing his people. I absolutely can't remember the name of it, though.

Reply to
Tim Wescott

But, what do they *gain* by doing so? I.e., hindsight indicates how "correct" their code was along with how long it took to write. All their efforts do is obfuscate any data that *might* be used to help make these predictions ahead of time.

I.e., if you actively *track* evolving code metrics along with completeness and correctness, gaming the numbers just lets someone

*later* say, effectively, developer X's metrics are worth LESS (predictively) than developer Y's.
Reply to
Don Y


Both might be correct, one might be better or faster or easier to update.

Some years ago, I was working on a program written by someone else (who may or may not have had any code metrics) that had a long series of IF statements, each with one assignment, where I would have written a loop and a look-up table. The loop and look-up table might be three lines, the series of IF statments was 16, but could have been much more, or a little less.

If someone is paid by lines of code produced, or had a productivity measurement done by lines of code per day, they might have incentive to write the IF statements.

If you have software to measure code complexity, I suppose, but for a large project it is likely too hard to compare. By the time it comes to updating, or otherwise keeping old code running, it will be long forgotten who wrote it and why.

There are many stories in "Mythical Man Month" about OS/360, some related to byte targets. When RAM (magnetic core) cost on the order of $1/byte, keeping things small was pretty important. But without the appropriate metric, the result was moving things to places where they weren't counted. Many important OS contol blocks are in user space, where other OS might have kept them in system space.

I suspect it is about as hard to do the metric right as it is to write the software in the first place.

-- glen

Reply to
glen herrmannsfeldt

Yes, but are folks *really* paid/incentivized that way? As a regular employee, it was always about the *job* -- no one evaluated your daily effort. As a contractor, it was either "per job" or "per hour" payment. Again, no one knew if you wrote 5 lines of code for that job/hour or


Yes. My point was there's no reason to "game" the numbers ahead of time (i.e., prior to completion when "all the results are in").

[This assumes there is no other incentive, see above]

Sure! Ages ago, a network connection was priced based on the class of machine sitting on the wire. So, put a dinky PC there and hide your VAXen behind it! :>

I'm not sure you have to get it "right". E.g., after the fact, you can (try to) correlate correctness, effort, etc. to data that you collected

*during* the process.

E.g., we each have our own schemes for estimating effort a priori. And, some of us track "defects" during the process to get a feel for "how close to done" we are.

The question boils down to the value of having some measure that can be applied to evaluating projects before hand as well as along the way. Instead of just "winging it".

*But*, if folks are going to change their behaviors midstream, this would tend to invalidate those observations: "I've been losing weight at a rate of 1 pound per week. But, I changed my diet, yesterday..." :<
Reply to
Don Y

It has happened. In the early 80s I did some work in a shop where the new programming manager instituted LOCs as a productivity metric, which then factored into raises and bonuses.

We were contractors, so I was often able to ignore some of the politics. But when some of the less bright bulbs working for the company started using their suddenly massively improved "productivity" to take pot-shots at *us*, and we were officially notified that our performance was deficient on that basis, it took only a few printouts of before and after versions of programs* from the VCS to cause

*quite* the ruckus.

Needless to say I was not popular with the afore mentioned dim bulbs, or the manager in question, but most of the programmers loved me for it (they were obviously being screwed by these guys too), and corporate used us for years for straight answers.

*This may be the only time I ever thought that the English-like readability** of Cobol was actually an asset. When the "after" version had the same exact Cobol sentence spread over more lines, it was pretty obvious what was happening, even if you had no clue about programming.

**Such as it is.

Reply to
Robert Wessel

Writing fast and sloppy has been glorified and institutionalized through the use of "agile" methods, rapid releases, push updating, and using the customer as unpaid testers.

"Hey Joe! Customer had a problem doing ______" "No problem! I'll push a fix in the morning update."


Weinberg's Second Law: If builders built buildings the way programmers 
wrote programs, then the first woodpecker that came along would 
 Click to see the full signature
Reply to
George Neuner

Does it *still* happen? Haven't people learned anything in the 30 years since? Code reviews?? (or, do you bribe your peers?? :> )

But that's just an example of a PHB who doesn't understand the technology he's managing! It wouldn't take much effort to talk him/her into a corner: "So, all that matters is how MUCH code I write, correct? Whether it works or not isn't a factor? Likewise, efficiency?" followed by a trivial example:

ASSERT(i>0, j>0) product = 0 for (i = multiplier; i > 0; i--) { for (j = multiplicand; j > 0; j--) { product++; } }

Was this a result of them manipulating their programming styles (and, thus, metrics)? Or, were they just sloppier coders to begin with?

I.e., this leads to one of several outcomes -- most of which are bad for the organization (and, immediately, the "manager" involved):

- fire you less productive people and let the "stars" do it all (i.e., shittier developers)

- let those with the lower (better) metrics learn how to inflate their metrics (i.e., shittier code)

- rearrange the task allocation so the "stars" can take on the tougher responsibilities -- as they are obviously more capable of doing so! (i.e., shittier result -- time, money, quality, etc.)

It's just a typical short-term anomaly that comes back to bite folks in the end. E.g., the "stars" are now stuck ALWAYS writing shitty (and shittier!) code lest their metrics start to drop. Even an idiot soon realizes that he's got to "produce" come the end of the day. "You've written 27MB of sources and the product still just sits there 'initializing memory'..."

I don't think it makes sense to compare metrics between developers. Even assuming "honest" folks, there are just too many variables in style and problem/development domain. Two functionally equivalent approaches to a problem could have significantly different metrics.

OTOH, I would think individual developers would benefit from knowing how their coding style, etc. pans out *quantitatively*: "Oh, I don't keep score, Judge" "But how do you measure yourself with other golfers?" "By height."

People are notoriously bad at remembering how much previous efforts "cost": "It was a three month effort..." "Yeah, but were you working on it EXCLUSIVELY for those three months?" And, few places seem to actively quantify bug detection and removal rates: "We finished in just over 6 months..." "Yeah, but you were still encountering a bug-per-day at that time. How do you consider that 'finished'? Just because the boss pulled the plug at that point??"

Do you know if a refactor *really* gained you anything -- in terms of performance, correctness, complexity reduction, etc.? Do you know if a change in your coding style produces *measurable* improvement? (i.e., should you even bother recommending it to others?)

Reply to
Don Y

But the numbers you'd gather from *that* effort would (roughly) translate to a similar effort undertaken in that same style.

And, would give you a way of comparing a *different* development style to that one with measurable results: "Yeah, we got the code to the user a lot quicker, the old way. But, it ended up more complex and more costly and we had a customer grumbling all the time we were issuing those endless updates! -- 'when is this thing going to be *done*?'"

Reply to
Don Y

There is:- "The Impact of fault models on software robustness" see

"Choosing Error Models for OS Robustness Evaluations" by Stefan Winter see

"Exception Handling" by Charles P. Shelton see (note thatthis is a paper by one of Phil Koopmans Students.

I hope they are useful for what you need.

I rather try and remove problems in the software by getting the requirements specification de-bugged first. I find that specs that result in Clear, Concise, Correct, Complete, Coherent and Confirmable statements of requirements will lead to many fewer problems in the code. This requires that even the cyclomatic complexity of the requirements specifications should be minimised.

I gather metrics, through the review process, for all problems found in specifications and designs and, like any well managed project, solving the problems early in the lifecycle has a massive benefit. Which is why projects need a good deal of "front-loading" in order to produce a good plan. It seems to me that Systems Engineering and Project management have a lot in common in that respect.

Paul E. Bennett IEng MIET..... 
 Click to see the full signature
Reply to
Paul E Bennett

Would that be the McCabe Cyclomatic Complexity measure?

Paul E. Bennett IEng MIET..... 
 Click to see the full signature
Reply to
Paul E Bennett


I often think a better metric for bonus payments are on Function Points correctly implemented (passed through test without detected errors).

Paul E. Bennett IEng MIET..... 
 Click to see the full signature
Reply to
Paul E Bennett

"Get the metric guys to shut up and let me do my job."

This is what I'm experiencing. If the metric checking tool says "this function containing 10000 assignments for a sparse matrix is too long", the function is split in 10 functions of 1000 assignments each (instead of considering a table and a loop). If the checking tool says "you need a cast here", a cast is added (instead of examining whether there could be actual truncation happening). If the metric says "you cannot have more than X tickets", changes are implemented without tracking tickets. If the metric says "you cannot release a change without a ticket", huge refactorings are billed to trivial tickets.

(That's the observation of about half a year of code-review.)


Reply to
Stefan Reuther

One would hope not, but given the level of stupidity in the world I'm sure someone out there is still counting LOCs...

Some were just adjusting their writing styles to game the system. Worse offenders, and there were several, were literally turning existing code like:




And worse. One fellow had actually reduced(?) some sections of a program to a single token per line. Seriously.

There was an automated procedure in place to count the increase in LOCs at checkins. There were actually cases of negative productivity, which should have been a huge red flag, but apparently weren't.

You forgot demoralizing the actual professionals.

And it did come back to bite a number of people in their backsides. I just happened to help* precipitate the boomerang. Think about the immediate effects on your career when you're caught doing extra work just to game the rating system.

*FSVO "help". It probably would have happened sooner or later, but I was the one who stomped into the CEO's office with our reprimand letter, the department productivity reports and a stack of program listings... Ahhh, youth... I have perhaps gotten a bit more tactful with time.

Without doubt.

Still there are painfully obvious, and very large, differences in skill levels and productivities between developers. A desire to measure that in someway is not unreasonable. That's it's actually really, really, hard to do is a different issue.

Of course we don't. With a bunch of experience, many of us like to think we have a handle on that, but of course we don't, at least not in any strict sense. Sticking to some basic principles is often a good thing, although not blindly. Still, we usually can look at a bit of code and say "that's a mess" or "eh" or "nice" or "my eyes! my eyes!" pretty quickly.

Reply to
Robert Wessel

Metrics (or all sorts) should only be used in an advisory capacity -- to give you a way of *measuring* the code against *some* yardstick. They only make sense from a developer's point of view if you have two teams developing the same product in isolation and want to explore consequences of the differences in their abilities, approaches, etc.

[Outside an academic environment, I can't see that happening!] *Or*, if you want to compare project A to project B with the *same* crew (and technique).

E.g., defect counts from different projects tell you -- what?? How do you weigh the "value" of a particular defect? Project A has only *one* defect -- it crashes as soon as main() is invoked! :> Project B, OTOH has hundreds of (little) defects. Which does the metric indicate is "better"?

Metrics also don't make sense on small projects -- it's too easy for a developer to manipulate his style FOR THE SAKE OF A METRIC. Keeping up that behavior consistently over a longer period of time and more diverse piece of code quickly takes too much effort (and makes you too visible in your lack of *real* results)

And this wasn't obvious to the manager? Was he, perhaps, a used-car salesman by trade??

Choice of metric can also impact how much leeway you have for gaming it. E.g., "count the semicolons" vs. "count the newlines"; count blanks separately. Count commentary separately.

For example, I tend to use lots of newlines and trailing comments:

static const char * get_message( subsystem_t subsystem, /* subsystem id for error context */ error_t error_code, /* error code of interest or "NULL" */ ... /* varargs related to error code */ ) { ASSERT(subsystem != NONE); /* contractual requirement */ ... }

I wouldn't expect this to "count" for anything more than:

static const char *get_message(subsystem_t s, error_t error_code, ...) { }

However, metrics that bring out the differences between the two styles (comment count, newline count, whitespace count, etc.) *can* be helpful in identifying (statistically) differences between the "quality" of the code as well as the effort to create it.

You want to be able to ask questions (of yourself, if no one else) like:

- does tagging each argument with a comment lead to less buggier invocations?

- does adding all these comments impact the time to write the code?

- does this style lead to code being revisited/reworked more often? etc.

I'm setting up infrastructure to count and log metrics during development. Together with logging bugs and "rework effort", I am hoping (the right metrics) can be correlated with the total effort, correctness, etc.

- what are the impacts of *no* comments?

- at what point do additional comments *reduce* quality, correctness? etc.

Regardless, it doesn't turn out as a "plus" for the organization. So, if your goal is NOT to "improve" (something, anything!) in the organization, why not turn off the lights in the offices and replace the computers with pens/pads and outsource all the "typing"? (I worked at a place where developers were encouraged to let a data entry person "do all their typing") After all, you don't *care* about results, right?? :-/

The problem was the *application* of the metrics.

E.g., when I write technical documentation, I routinely use a tool that reads my text and tries to predict a "reading (grade) level" for that text. My goal isn't to get that number as high as possible -- as that makes comprehension suffer.

Nor is it to drive that number down as *low* as possible (the so-called "reading at the 6-th grade level" mantra).

Rather, it is a way of measuring myself against my own expectations for the piece: should this be *easy* to read? or not? Who's my audience? What's the subject? What should I *expect* from them?

Hence my suggestion that you *not* do it!

E.g., I design severely resource constrained, real-time systems with insane uptime requirements. How can you compare my style/effort/results with someone who writes code that is updated *daily* via a broadband connection (while mine requires in-depot updates or field replacement)? Or, with code that can be routinely "rebooted" (e.g., to address memory leaks)?

Sure! I had a colleague once call me an "artist" in the way I approached my projects (hardware and software). Another once commented that "It took me a while to figure out why you were doing what you did -- but, once I did, everything became glaringly obvious!"

But, Manglement would still ike to be able to put numbers on what we do. And, *I* would like to be able to see how my "performance" is changing with time/experience. As well as have tools to tell me if I made the right decision (e.g., refactoring something).

It's the same argument as measuring *before* optimizing: unless you can put *some* numbers on "it", there is no way to understand even what

*direction* "it" is moving!

The fact that the metrics may be (somewhat) "bogus" is immaterial. As long as they can reasonably be compared (to each other).

I worked at a firm that made hand tools. How do you evaluate your "process", there? Is *this* hammer "better" than *that* one (after we replaced the wooden handle with a fiberglass one)? Is this tape rule better than that (after we replaced the coating on the "tape" with a different formulation"? How do each of these compare to our

*competitors'* products?? [i.e., there are no user-visible "specifications" for hand tools. Ever wonder how accurate that tape rule *really* is? In *engineering* terms?? Or, do you just *expect* it to be "accurate enough"? Why do the tips shear off on these Philips screwdrivers? Or, bend/twist on these slotted ones? Am I "too strong"? Or, are they "crap"??]
Reply to
Don Y

But that's (IMO) a misapplication of metrics. They're trying to use it to control/impose quality, process, etc.

I believe virtually all "rules" are mistakes when it comes to software development. They should be considered "guidelines". Making things rules suggests you have incompetent folks doing the work (and bean counters trying to constrain it). Expect people to know their craft.

No dynamic memory allocation? Why?

SESE? Why?

Verify divisor != 0 prior to use? Why?

Maintain stack protocol? Why?

Tools should provide *guidance*; the developer should evaluate that guidance in the context of the problem at hand.

E.g., I target "no warnings" in my compiles. But, that's because there isn't a language feature that allows me to insert: #acknowledge The 'missing cast' error can be ignored, here that allows me to indicate that I am aware of the "warning" and have evaluated it properly. *AND*, that the compiler's failure to signal that warning at this directive should, itself, signal an ERROR (i.e., this acknowledgement shouldn't be here). As it would be a type of "comment" that the compiler *could* indicate was "incorrect"!

[Instead, I fix each warning and just expect each version of the code to throw *no* warnings -- as that can easily be verified by another developer in the absence of an "acknowledge" directive]
Reply to
Don Y

Virtually all incentives can be gamed -- to the *detriment* of the person doing the incentivizing!

As a contractor, I've come up with a practical solution: bug fixes are free. So, it behooves (OMG!) me to get it right *before* release. In that way, the client sees the total cost "up front" ( "up middle"?) and not nickels and dimes for months/years thereafter. Coupled with fixed cost quotes, I considered it an attractive solution to manage the risk (also forces clients to know what they want and outline that in "terms" that can be *measured* in the deliverables -- and, keeps them out of my hair: "Sorry, any changes have to come in a new contract. Do you want to *cancel* this one, NOW, and 'square up' with what I'm owed -- as I may not want to take on your *new* contract!??" :> )

Reply to
Don Y


Thanks! I pulled down each of them (no PDF for the last?) and will dig through them in the next few days.

Agreed. But, you can't automate the measurement of specs. :-/

Agreed. But, I only think this works for waterfall-style processes. The others currently /en vogue/ seem to be glorified ways of *ignoring* the "homework" phase.

I'd like to set up an environment where much of this can be measured automatically and track "performance", over time (drawing any potential "conclusions" at the finish). Unfortunately, I can't see how to instrument the "time" aspect of the effort. It's misleading to note the time between checkout and subsequent check-in as representative (even loosely!) of the time spent working on/staring at a particular piece of code. Even if you could measure the time during which the code was "active" in an editor/IDE, there is no guarantee that the developer is *looking* at it! Or, *thinking* about it!

Reply to
Don Y

If you want to see this debate played out for a wider audience, take a look at all the methods which have been suggested -- or used -- to evaluate "teaching" or "education": class sizes, favorable student reviews, amount of money per student, multiple-choice tests, GPA, parental feedback, ... all of which seems to indicate that there is no generally agreed-upon measure of either "efficiency" or "effectiveness". Which, since we're talking about human beings here, doesn't stop a lot of heat -- and the odd bit of illumination -- being generated on the topic.

"We know good/bad coding when we see it" seems as good a metric as any, as does "We know good/bad teaching when we see it". It just takes a lot of time and effort, and depends on honesty and trust... which are not "mechanical" processes.

Good luck!

Frank McKenney

  Quackery is false science; it is everywhere apparent in cheap and 
  popular science; and the chief mark of it is that men who begin 
 Click to see the full signature
Reply to
Frnak McKenney

...which is precisely my complaint with these metrics.

When using a complicated or pitfall-ridden language such as C or C++, you cannot assume people know their craft perfectly, so a tool sounds like a perfect excuse. What I believe is overseen is that even if you have a tool, you still need a competent guide to tell people what to do with the tool's message - and a livable process for people to confirm "yes, I know more than the tool at this point".

Without that, intermediate developers will believe the tool, and the advanced ones who don't believe the tool will game the system because of the complicated deviation process (doing protocol means two more hours overtime, gaming the system means five minutes). Neither improves software quality.

"no warnings" is a pretty vague goal, because every compiler warns differently, and some warnings are unavoidable. For example, u++; will warn with gcc and -Wconversion if u has a type shorter than int, and the only way to shut that up is to convert it into something verbose like u = (uint16_t) (u + 1); That's why I have mixed feelings about such a rule (but normally target "no warnings" as well).


Reply to
Stefan Reuther

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.