"normalizing" data

Hi,

The Back Story: I have a "units calculator" that lets the user do things like: (1 mi + 5300 ft - 7 yds + 12 in) / 20 min = 6 MPH [assuming I have done the arithmetic correctly tonight]

A consequence of this ability is the user can specify things like "4 ft 3 in" (instead of "4.25 ft").

My parser doesn't enforce particular rules on how the user specifies such values (though, obviously, the units must be similar -- no adding inches to candelas!). So, something like: "3 in 7 yd 5 ft" (!) is legitimate.

Issue: There are many places in the application where user

*stores* such data (i.e., where it must be "recall-able"). I currently do NOT "normalize" the data when re-presenting it to the user. I.e., the above example is regurgitated as "3 in 7 yd 5 ft" and *not* as "8 yd 2 ft 3 in" (nor as "26 ft 3 in", "26.25 ft", etc.).

This recognizes the fact that the user had a reason for specifying it in a particular manner and silently converting it to some "normalized" form would only confuse him/her. ("Hmmmm... I thought I specified '3 in 7 yd 5 ft'... is that the same as '26.25 ft'?")

[the user can always have the data converted to whatever form he wants -- including something like "mm fur in"]

However, I *do* process his/input to remove "superfluous" characters -- leading trailing zeroes, extra whitespace, etc.

This saves an insignificant amount of time in subsequent processing of the data. And, an even more insignificant amount of *space*. As such, it seems like my reasons for doing so are not really justifiable -- why not have the user ask for "pretty printed" data just like he would/could ask for it to have been converted to other arbitrary units??

Furthermore, there can be significance to some of that stuff that I am stripping off -- "2.000" is different from "2".

Comments?

Thx,

--don

Reply to
D Yuniskis
Loading thread data ...

There's that whole significant digits thing... Without further qualification, a measured value recorded as "2," implies that the actual value is in the range 1.5..2.5. "2.000" implies that the actual value is between 1.9995 and 2.0005. The issue being the (always finite) precision of your measuring device.

Any student in first semester physics who added "1.23 meters" and "3" meters, and answered "4.23 meters" on an exam would be marked wrong.

Reply to
robertwessel2

Exactly.

Yes. Fyziks questions are always best answered with one of the following:

- "a bright orange flame" (often qualified with "and a loud noise");

- "for sufficiently small values of (insert favorite quantity here)"; or

- any equation with a 'c' in it.

OTOH, I don't see any "significance" to the *actual* whitespace used -- nor to *leading* zeroes.

OTOOH, I can't read the user's mind -- and I don't see any *huge* downside to leaving all this cruft in place (i.e., if the user thought it worthwhile to *enter* it that way... )

Reply to
D Yuniskis

There's one downside: the user might want to be sure that the data had been properly interpreted and stored. We all make typos. Don't you prefer to be informed of them as soon as possible?

-- Joe

Reply to
J.A. Legris

Yes, that is the reason for not "reducing/normalizing" the data (as I said in my original post). If the user wanted to specify "2 ft 39.27 in" (!) then, presumably, there is a *reason* he opted for that form instead of "5 ft 3.27 in".

*But*, is there any reason why I should *literally* preserve " 00000002 \t \t feet 0039.270000000000 in "? Would it "confuse" the user that much if he later saw it as "2 ft 39.27 in"? I guess I just can't see where the extra fluff becomes significant (unless it had to do with formatting in the application that feeds the data to me).
Reply to
D Yuniskis

I don't see any reason to meddle with what the user entered. Either way you need to store the value you will use, and the value they entered, so I'd just leave the latter alone.

Clifford Heath.

Reply to
Clifford Heath

My take on it is that once you make the decision to preserve what the user entered, to what ever extent, then you should save it all. The proviso that it passes the parsing legitimacy test.

--
Michael Karas
Carousel Design Solutions
 Click to see the full signature
Reply to
Michael Karas

or

=A0 "?

You've interpreted my post exactly the opposite of what I meant. Read it again.

If your programming is as sloppy as your reading, you've just suggested yet another reason to normalize - there are probably a few bugs in your code that will misinterpret some strings. The user needs feedback that indicates exactly how the machine has interpreted the input.

-- Joe

Reply to
J.A. Legris

Sorry, I've read it twice more and *still* stand by my reply.

I don't see how preserving the leading/trailing zeroes and various flavors of whitespace helps him verify that the code "interpreted" it correctly. (for the alternative interpretation, see below)

If the user wants the data "interpreted" (i.e., assuming you want, interpreted to *mean* "normalized"), he can always *request* that!

(perhaps YOUR reading skills failed to note the caveat in my original post: '[the user can always have the data converted to whatever form he wants -- including something like "mm fur in"]' e.g., requesting it as "ft in" would yield "5 ft 3.27 in")

Reply to
D Yuniskis

-----------------------------------------^^^^^^^^^^^^^

I keep a "normalized" value but that is not in a form to which the user would easily relate.

I guess the problem I see is the user's acceptance of this. It is contrary to what he/she/we typically encounter in our interaction with devices. E.g., type leading zeroes on a calculator and they are absorbed *as* typed. Trailing zeroes are preserved *until* the calculator can determine that they need not be (i.e., no non-zero digits follow). The actual content of "whitespace" is indistinguishable from "lots of spaces" in almost all applications. etc.

We tend to interact with devices that *do* "normalize" (in this sense) our "inputs" -- and do so, "interactively" (i.e., they absorb the zeroes while we are typing them). Most devices immediately convert our input to some sort of normalized form as soon as we have ENTER-ed it. Or, inherently impose that normalization on us (e.g., by only giving us normalized *choices*).

So, its a question of "least surprise". Would the user be more surprised to see his result come back "dressed up" (and then incur the cognitive load of having to decide if that is really what he typed?) or would he be more surprised to see all of his blemishes echoed back at him?

(e.g., in the latter case, how then do I allow him to identify tabs within whitespace, etc.)

From the code's point of view, the storage and (repeat!) processing consequences aren't a real problem. (OTOH, I have seen some applications suffer badly from making the choice to repeatedly reparse data -- I can cheat there by parsing and converting to some "internal" normalized representation alongside the "as entered" representation).

But, if the application's front-end does the sort of interactive preprocessing we see in many appliances, then NOT normalizing tends to look like a schizophrenic interface (i.e., it's doing SOME of these things but not ALL)

I need to investigate all of the front-ends to see if they can/will be consistent in this regard -- before I come to a conclusion...

Reply to
D Yuniskis

The beauty/appeal of this is it is a simple rule that the user can easily remember. No "exceptions" to deal with.

"Huh? Why are all those leading zeroes in there? Oh, because I must have TYPED them in there! D'uh..."

OTOH, in means I need to make sure the user can readily recognize every "thing" he has typed. E.g., if I allow whitespace other than "spaces", then he must be able to differentiate those stored/recalled "non-spaces" from the REAL "spaces".

OK, maybe just outlaw everything other than spaces!

Reply to
D Yuniskis

The normalization question seems somewhat at odds with an earlier post about significant digits.

IIRC, someone pointed out that 1.2 * 3.4567 = 4.1148

and not 4.2 violates your physics instructor's admonition to not assume greater precision in the result than is evident in the operands.

Here's the problem: if the user enters 1.2, do you assume that it was really user-truncated 1.20000000, or was it

1.2 +/- 0.05?

OTOH, as a programmer, do you want to make a decision for the user and accept 1.2 x 3.4567 to be 4.2?

Calculators don't seem to worry about significant digits and generally give you as many digits as their screen will display or as many after the decimal point as you request. Should your normalization do the same?

Mark Borgerson

Reply to
Mark Borgerson
[snipped clutter]

"Normalization" is a bad choice of words -- but, I'm at a loss as to what a *better* term might be.

There are several issues hat I have inconveniently wrapped into one -- and arbitrarily called '"normalizing" data'.

- maintaining precision/significant digits

- "pretty printing" (for want of a better term)

- user expectations/experiences

(note that I have skipped over "efficiency" completely)

The calculator uses variable/arbitrary precision throughout. It's up to the user to decide how much precision he wants in his results (we'll skip the obvious precision/ressolution/accuracy argument here, please :> ). In some cases, the application (using the stored data) imposes its own criteria on the *use* of the stored data. In others, the user is free to decide (i.e., if you really want to know what 1AU + 1mm is -- expressed in AU, of course -- you are free to ask for that to whatever "precision" you would like)

And some don't even do *that* "accurately"! :>

That's the question I am asking on behalf of the hypothetical

*user*! I.e., what is the "least surprising" thing to do with the user's "input"? What are *legitimate*, purposeful actions on the user's behalf that should be preserved -- and which should be ignored?

E.g., almost all data entry methods "cook" the input stream. Should the "raw" keystrokes, instead, be saved? (I'm being silly here to illustrate a point).

I, for example, store all physical constants, etc. to the greatest *defined* precision available to me (much to the chagrin of legislators in Indiana :> ) with the belief that I can always throw away precision (before or after computation) but can rarely *gain* it. And, the form in which I present that data (for storage) always has *some* sort of rational basis beyond "arbitrariness" or "convenience".

I am giving the user the benefit of the doubt and assuming that there may be some method to *his* madness...

Reply to
D Yuniskis

The canonical computer industry word is canonicalization :).

As a user, I care about two things: That you have an accurate representation of what I meant (that includes not truncating digits; just because I entered 1.2 doesn't mean I didn't intend

1.200000) and that when you redisplay what I entered, I can easily see that it hasn't been changed from what I entered.

Redisplay the data as it was entered (i.e. store the unchanged lexical form), and define and use a canonical form.

If there's a big difference (lexical form allows expressions, for example) then display both values. Expression evaluation should not be confounded with input validation.

No, never - unless the canonical representation is limited in precision.

Were they the idiots who defined Pi as 22/7?

Clifford Heath.

Reply to
Clifford Heath

So, if I allow non-space whitespace, I need a way of letting the user verify the presence and location of those non-space whitespace characters in the input (?)

I only allow access (indirectly) to that internal form. E.g., I store data as arbitrary precision, floating point, decimal, rationals (blech... what a mouthful). Exposing those directly to the user would just confuse him. OTOH, it lets me *guarantee* that if you *specified* it, I can *represent* it (to whatever precision you eventually decide upon)

Or, if the user expressly requests a lesser precision (directly or indirectly) for the "result". E.g., 2.5 t is 1 T (to one significant digit -- ~16% error)

Actually, I don't think they were even *that* accurate (IIRC, they were content with "3"). FWIW, I think saner heads prevailed and prevented this idiocy.

OTOH, I think there is a "half hour" timezone someplace in Indiana (?)

OTOOH, AZ doesn't observe DST (OTOOOH, maybe that's a sign of sanity? Definitely nice not having to remember to diddle the clocks twice a year -- though DOUBLY annoying to have to remember when everyone ELSE has!!)

Reply to
D Yuniskis

Non-space whitespace is formatting. If you allow it, you have to decide whether you care about (a) formatting, (b) expression evaluation or (c) flexible entry of individual values. Trying to do all three is a mistake.

I'd reject tab, newline and carriage return characters in the kind of input field you've described. Tab is expected to jump to the next field, and Enter (used for ^J, ^M) normally commits the form. Fulfil those expectations unless you have a *very* good reason not to.

I think what I'd do is to display a single numeric field, and provide a little calculator icon that pops up a larger entry field, where you can enter an expression, and see the calculated value in the single field when you commit the result. Display using the precision implied by the most precise value in the input. Optionally preserve the original expression as entered.

There are a lot of places like that. Adelaide, in South Australia, is one.

There's less point in DST when you're near the equator and twilight is short. Queensland is in the same timezone as Sydney and Melbourne, but doesn't do DST.

Clifford Heath.

Reply to
Clifford Heath

formatting link

Reply to
Nobody

Well, that admonition is completely nonsensical if ripped out of the educational context it came from. Oh, and just BTW, let's agree the correct 2-digit result is 4.1, not 4.2, shall we? ;-)

Spelling out exactly all those digits that are not dominated by error is indeed one (rather coarse) way of expressing information about a value's precision. But it's by no means the only one, so there's really no viable reasoning to always assume '1.2' would somehow have to mean "1.2

+/- 0.1".

Some people in the field of high-precision measurement, e.g., have devised their own notation: 1.234(56) means 1.234 +/- 0.056 (the number in parentheses is the error, and it's understood to be right-adjusted with the value itself, i.e. it ends in the same decimal digit as the value).

It doesn't matter --- both assumptions are wrong.

Reply to
Hans-Bernhard Bröker

Good catch. There were two problems with my result: first, the actual answer is 4.14804. I inserted an extra 1 digit. Second, While fooling with the numbers, I first rounded to 4.148. then 4.15, then 4.2! That goes to show that repeated rounding can get you into trouble.

Interesting. I haven't run across that before.

Worrying about significant digits outside the classroom is probably not worth the trouble and was probably more emphasized in my early days in college physics when calculations were done with a slide rule.

Reply to
Mark Borgerson

You're thinking more along the lines of a GUI.

I'm trying to cheat and omit the need for a plumber -- yet still allow "values" to be moved between applications freely. I was hoping to fold (some portion of) the plumber's role into the input syntax -- easier than coupling the plumber to each such application.

I suspect I will either have to design such a plumber *or* require the user to more finely "select" the "value" to be cut/pasted. (doing so means i can then tightly control the input syntax)

Not sure I understand your reasoning, there.

Regardless, other states at the same latitude *do* observe DST so it's just inconsistent (gee, is that surprising??)

Reply to
D Yuniskis

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.