Attention: European C/C++/C#/Java Programmers-Call for Input

- P
- Paul Carpenter
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 7:19 PM

n =20

=20

NO legislation dictates what the PROGRAM calls a record or even a field of it. the legislation may well dictate the MEANING of the data and=20 rarely the format (Social Security number, Zip/Post Code). The=20 legislation will dictate how the data is used and type of data, how the data is presented (screen or print).

As far as legislation is concerned records called A,B,C have as much=20 significance as 305787, 298770, 16398698...

Legislation may dictate the meaning and contents as how many official languages and format etc.. =20

n

Which is to do with processing of the data, not how the program is coded. Even legislation on how the maths is done does not dictate how the code and its variables etc. are typed in.

--=20 Paul Carpenter | snipped-for-privacy@pcserviceselectronics.co.uk PC Services Timing Diagram Font GNU H8 - compiler & Renesas H8/H8S/H8 Tiny For those web sites you hate

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 7:40 PM

Why not? Add a comment in your national language, describing the mapping.

However, I must admit that terms defined by national legislation are rare in my day work. This is comp.arch.embedded, not comp.software.income-tax-forms. If I have to implement "law X says my device may not do Y", I put that in a comment. "law X" ist not a first-class object in my code.

A former Commodore BASIC programmer will tell you that 2-character identifiers are enough, too :-)

Stefan

- P
- Paul K. McKneely
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 8:04 PM

No. Not like that. I see 64 control characters and lots of other non-alphanumerics in there.

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 8:52 PM

It's no problem for Thunderbird, but it seems to be a challenge for Paul McKneely's Outhouse Express. No doubt along with their proprietary character set compiler, and their own editor, they will be writing their own newsreader and email client to work with their character set. No one would ever use such incompatible tools, of course, but it will keep their programmers busy.

Actually, mathematicians use funny fonts because they write with a pencil, avoiding any encoding issues. When writing on a computer, they

*do* have multi-character identifiers - they write \pi\ and \alpha.

And even if you really want 2?, you pre-calculate it and store it as a constant called "twoPi" :-)

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Wed, Jan 28, 2009 9:04 PM

Please don't top-post - you'll wake the sleeping net-nannies.

Boudewijn did not imply that Europeans are "rude jerks" - *you* inferred it from his first reply.

You came to this newsgroup asking opinions about inventing a new character encoding to allow European programmers to use native language identifiers in a new programming language. Now you've heard those opinions - it's an amazingly stupid idea. Whoever thought of it clearly has no clue about European languages or their alphabets, no clue as to how Europeans write their code, and has apparently never thought to actually *ask* European if they would want such a "feature".

- P
- Paul K. McKneely
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 3:40 AM

I said right up front that I am an American who wants to have some feedback from Europeans because I don't have full appreciation for what you just said above. That is why I posted in the first place. Now why so many of you get so angry because I ask for your opinions, and then you turn around and claim that I am so thoughtless as to never ask for you opinions. Duhhhh!

Illustration: Boy: Daddy. Can I please have a drink of water? Man: Why don't you ever ask before you get a drink of water you stupid Boy!

So what did Europeans think about this "incredibly stupid idea"? Or did the Java development team forget to ask any Europeans if they even wanted it?

But seriously. I gave up trying to please others. I just try to please myself. I am putting those little extras in there for when I want to use them. In my way of thinking, those "funny looking European characters" are for extending English for those who speak English. The language has adopted a very large number of foreign words and I am regularly corrected by those who primarily speak English on how to properly pronounce them. If I should pronounce them properly when used in an English sentence, then why shouldn't I spell them properly too?

But let me ask you a question if you are willing to honestly answer it. Are you angry at me because I am an American? Do you feel that Americans cram their ideas down your throat and you have little to say about it? I really would like to know. Because if the answer is "Yes", I am inclined to simpathize with you because they do that to me too.

- F
- Falk Willberg
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 7:54 AM

Paul K. McKneely schrieb: ...

Probably they asked "WEB-designers".

I don't see anyone angry here. And there are so many nations in Europe, that we can easily pick on each other.

We have established a European government to do that job ;-)

Falk P.S.: ASCII is OK

- D
- Dirk Craeynest
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 7:54 AM

In article , Paul K. McKneely wrote: [...]

As you can in Ada?

About the character set usable in Ada, see the following section in the Ada 2005 Reference Manual

and the corresponding section in the Ada 2005 Rationale .

Hope this helps,

Dirk snipped-for-privacy@cs.kuleuven.be (for Ada-Belgium/-Europe/SIGAda/WG9 mail)

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 8:17 AM

I know you are American, but you *do* understand English, don't you? No one is angry or annoyed - I don't think anyone but you has posted an angry or directly rude post. Perhaps you are unaware of how British English speakers (and many other Europeans when speaking or writing in English) use things like sarcasm and understatement for emphasis.

I too can give illustrations:

Boy: Daddy, I'm thirsty, so I'm going to have a sandwich. Do you think the sandwich would be best with potato or with cabbage?

Man: If you're thirsty, have a drink. Try water or milk.

Boy: Why are you getting so angry at me?

Did you miss the key point? *UNICODE*. They very specifically choose a

*standard* for their encodings, not something incompatible and proprietary. In particular, it's very useful to be able to write comments and strings in Unicode - many modern languages allow it. If you had suggested using Unicode, or Latin-1, or listened to the idea when it was suggested, then you'd have got far more support - it's the idea of have a proprietary half-baked encoding that is incompatible with every other tool that is "incredibly stupid".

Allowing non-ASCII identifiers is a waste of time for European programmers. It may be of interest to those with more significantly different languages and writing, such as Arabic speakers or Far Eastern programmers, but I seriously doubt it. If your tools are expected to work with other compiler tools (such as using existing linkers or archivers, linking with output of other compilers, debugging, etc.), then allowing non-ASCII identifiers will lead to chaos. Sun can get away with it for Java because they don't need such interaction, so once they allowed Unicode for strings and comments, it cost them virtually nothing to allow it for identifiers. Being Unicode, they don't need to worry about other tools such as editors.

So now you are developing an entirely new programming language for your own benefit, and you are inventing a new character encoding just so that you can use variable names like "naïve" ?

I am *not* angry with you. I am somewhat frustrated that you have started out with a pre-conceived idea, asked opinions on your implementation of the idea, and can't seem to grasp that it was a terrible idea in the first place.

The idea of your having decided in advance what you think is best for other people without having asked them, particularly in reference to people from other countries, is certainly stereotypical American. But I try not to give much credit to stereotyping unless it is thrust upon me. I certainly won't blame you for being American!

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 8:35 AM

Most ones can. Even Outlook Express, but some configuration is required, IIRC.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/

- P
- Paul K. McKneely
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 2:50 PM

Yes, Dirk. That helps. Thanks.

- P
- Paul K. McKneely
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 3:28 PM

My fault for phrasing my original question badly. I should never have mentioned the words "character set". Forget that there is an internal encoding method that is used in the compiler tools for this new language whose codes will never be seen by its users. The programming lanugage supports only a subset of the complete UNICODE character set regarding the Western European alphabetics. The language only recognizes a maximum of 254 alphanumerics (Basic Greek and Cyrillic are included) for variable names etc. including the underscore which is regarded as alphabetic but ordinally precedes all others. If Western European programmers had to choose a subset of these for language support, which ones would they be?

But I gather now that European programmers, for the most part, don't care because these localized characters wouldn't be used in their programming anyway because of the inter-operability problems that arise when they are applied to source code. Since the programmers I speak of are not interested in them, but space has been allocated for many of them, I can take the huge tome of UNICODE characters and make the choices myself, a naïve American :) But I will also consider other subsets (some of which have been suggested by helpful posters) in the process of making my final decision.

Thank you (really) for your input.

Paul

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 3:54 PM

I still do not understand why you want to use some own internal representation instead of e,g. UTF-8. For any language using a Latin script for identifiers, the effective string length is 1.0x or rare cases 1.1x times the length of the identifier. For Cyrillic or Greek, the ratio is 2.0.

So the extra memory consumption e.g. in compiler symbol tables are negligible.

Regarding linkers, UTF-8 global symbol names should not be a problem, unless the object language uses the 8th bit for some kind of signaling (such as end of string) or otherwise limits the valid bit combinations.

Of course the UTF-8 encoding may increase the identifier length, but at least for a linker that usually examines only a specific number of bytes, such as 32, the only risk is that two identifiers are not unique within 32 bytes i.e. 16 characters in Greek or Cyrillic or 10 graphs in some East-Asian script.

Paul

- P
- Paul K. McKneely
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 5:21 PM

Simply encoding a kazillion different characters is not the whole picture. As Boudewijn Dijkstra pointed out, trying to alphabetize all of the potential UNICODE variables is impossible. (Those are his words, not mine and the ramifications go far beyond just this issue). So how do you alphabetize, search and list on an unwieldy character set for many purposes such as showing a listing to the programmer in his tool chain? That is not to mention that 21-bits (or 32-bits) are already used up in just the character's code. The new programming language supports fonts, color (foreground and background), attributes, size etc. Do you think it is a good idea to have to expand these basic character codes to

64/ 96/128 or even 256 bits in width just to cram it all in? The web people would want to encode it all in ASCII HTML-style tags which I think is a really bad idea. The overwhelming consensus among responders to these threads have voiced that they are not going to use anything beyond ASCII anyway. And with all of this text stuff, you haven't even begun to talk about how you are going to achieve all of the very advanced (and very difficult) stuff in the programming language, (much of which hasn't ever been done before) while carrying this huge load of excess baggage on your back. I needed to define some additional characters that weren't in ASCII (and aren't in UNICODE) for the purposes of the programming language (which predates UNICODE and UTF-8 BTW) Additional characters in APL being sited as the downfall for that language is not well founded in light of the fact that, when it came out, you had to put out a couple of thousand dollars for a hard-wired specialized terminal just to program in that language. That is besides the fact that it was not designed for the kinds of things that I want to do with it (such as writing operating systems and device drivers) Do you see my point(s)?

Simple, lean and mean, but more powerful than anything we have now. That is what I am shooting for. When symbols need to be converted to whatever format when object files are produced, that's where the necessary conversions will be done. This will keep the core of the tools much simpler (and smaller and run faster) so that the whole project won't collapse when I try to do the really difficult things that were the primary goals that I started out to accomplish in the first place.

I do want you to know that I do very much appreciate your input. This issue about object formats supporting UNICODE is going to be a real help when it comes time to generating machine code.

- F
- Frank Buss
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 6:18 PM

If you want more colors, font sizes etc., one idea might be to use something like TeX, e.g. like it is possible with literate programming. An example how it looks like:

formatting link

Simpler to type might be a more formal language, e.g. like Fortess:

formatting link

This sounds interesting, can you say more about your ideas? Maybe would be nice for some programmers, if you can use all unicode characters for identifiers or comments, but the more important part is the architecture of the language.

--
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 6:49 PM

Showing a listing: you just need a font that has all the characters. When I need to look up something in Unicode, I start by opening charmap.exe and selecting Lucida Sans Unicode.

Sorting? 100% correct sorting and case-folding is locale-dependant anyway. So you sort the locale characters in a locale-dependant way, and the others by their Unicode number. This usually gives a sensible result. Alternatively, invent some kind of sort (such as "sort all accented characters after their base characters"). And if you're only sorting them for your internal symbol management, which users don't ever get to see, use Unicode numbers.

Case-folding? Case-folding tables are very well compressible. I use a table of with about a dozen entries of the form struct { uint16_t FirstLowercaseCharacter; uint16_t FirstUppercaseCharacter; uint16_t NumberOfCharacters; uint16_t DistanceOfCharacters; } to case-fold a repertoire of I think over thousand characters. Entries are things like { 0x61, 0x41, 26, 1 } for ASCII, something like { 0x100,

0x101, 33, 2 } for the first half of Latin Extended A. In total, much less data than a case-folding table for DOS codepage 437.

What'll make it really complicated is composing/decomposing characters from their accents and the base character...

Depends on what you want to achieve, and at what point you'd manipulate these attributes. Using control characters ("escape sequences") would be one sensible approach. Using extents (an additional data item added to the string that say "characters 20 to 30 are red") is another. Both also give the possibility to add parameters to your attributes, such as font names or link targets. I use both approaches regularily.

Sure? The Unicode book is thick :-)

Stefan

- B
- Boudewijn Dijkstra
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 7:21 PM

Op Thu, 29 Jan 2009 18:21:04 +0100 schreef Paul K. McKneely :

If you are going to encode all this formatting information on a per-character basis, you are going to have a lot of redundant information, which would make compression a given. Then why not go all the way and encode a 32-bit Unicode character, 24-bit foreground and background, etc. in 128 or 256 bits?

Why? Most office suites have a decent HTML export functionality. Various HTML-editors are available. HTML and XML are not only popular in web-type applications.

I am curious to know which language that was, and which characters they are.

--
Gemaakt met Opera's revolutionaire e-mailprogramma:  
http://www.opera.com/mail/

- S
- Stephen Pelc
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 7:37 PM

There's probably more to be gained in the long term by sticking with a current standard of encoding. I say this because the real internationalisation issues are not in the character set, but in translation and display. Western Europe is the least of your problems, without even considering right-to-left display.

When you internationalise an application, even an embedded one, a standard process is to send your text to be translated from English (7 bit ASCII plus a few specials) to your dealer, who translates the messages into his/her language and sends it back to you.

Because you're writing a program for humans to use, you include things like dates, times, and currency. These all vary in format across the world. In addition, parameter order will vary in different spoken languages.

On a PC consider what happens when a program written in English by South Africans (three languages in daily use in the office), is run in Hong Kong on a PC with a Chinese operating system but for use by a Russian engineer who wants his package to display Cyrillic (several encodings available). This scenario has been seen in the wild. One customer of ours supports 17 different spoken languages in multiple encodings.

For most embedded systems it's not as extreme as that, especially as there's usually no operating system. Despite this, having to support several languages, even within the same country, is normal. We still have to support varying display orders in embedded systems.

Stephen

--
Stephen Pelc, stephenXXX@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads

- D
- David Brown
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 9:48 PM

I would suggest you start by giving up on all your thoughts of specific character sets. Simply make a straight decision now - you will use UTF-8. No other encodings - no Latin-1, no UTF-16, no home-made character sets, no extra fonts. Take it as a fixed decision and work with it for a few days to see how it fits your needs. Look at existing tools and source code that supports UTF-8, and see how it can make your work easier and give a result that users might actually be able to

*use*. If you really put in this effort and find that UTF-8 does not fit your needs, what have you lost? A couple of days work here is a drop in the ocean compared to the man-years it will take to work with your home-made encoding, and you will at least have the benefit of a better understanding of your problem. You might even be able to explain it to other people in a way that makes sense.

If you need to alphabetize, there should be no shortage of existing library routines for sorting in UTF-8. It's not easy - differences in locales can cause endless troubles, so you might not get a perfect solution. But you'll find something that does a reasonable job and

*will* work perfectly for most programmers who stick to ASCII identifiers.

A related problem is if you are making identifiers case-insensitive - it's hard to figure out cases for non-ASCII characters. So stick to case-sensitive identifiers.

I have no clue as to what you are talking about here.

Are you suggesting that you are including font, colour, etc., directly in the source code? And here was me thinking that a proprietary character encoding was an "amazingly stupid idea".

Who is "you" who are going to achieve all this? Do you mean the developers of the tools (i.e., you and your colleagues), or do you mean your users? And if it is us potential users, what is this "very advanced stuff" you are talking about? If we knew the specific aims of your language - what it is that makes it better than existing alternatives - it would be easier to advise you.

First off, you do *not* need to define additional characters. It's conceivable that your tools might *benefit* from additional characters (although, as I said, we know nothing about your tools). But they don't

*need* them.

Secondly, Unicode has openings for additional domain-specific characters

- you can add them without losing all the other benefits of Unicode (of course, you'll have to provide a suitable font).

No, I don't see your point at all. It reads as though you are saying APL's lack of popularity was not that it had extra characters, but that it needed an expensive specialised terminal (which was solely because of its special characters).

The main reason for APL's lack of popularity *is* the special characters. Even though you don't need special hardware (you use a specialised keyboard map and extra fonts), the characters make it impossible to read and understand for the non-expert, and extremely slow to enter expressions. It is *vastly* easier to write for example "range(R)" than "?R" because you don't have to find the special character. It is also *vastly* easier to read and pronounce, and to understand "range(R)" than "?R" even if you have never used the language in question (Python). To take an example from wikipedia's APL page, here is an expression to give a list of prime numbers up to R:

(?R?R°.×R)/R?1??R

The direct Python translation would be:

[p for p in range(2, R+1) if not p in [x*y for x in range(2, R+1) for y in range(2, R+1)]]

The APL version is certainly shorter - but nevertheless is slower and harder to write. APL's power and conciseness comes from the power of its built-in functions, not the fact that most have a single weird symbol instead of a multi-character name.

- P
- Paul K. McKneely
  
  Contact options for registered users
Vote on answer
posted
15 years ago

Thu, Jan 29, 2009 10:16 PM

Oh really? Where in RSA and what languages? (English, Afrikaans & isiXhosa/isiZulu/Setwana...?) My wife and I were there in September-October for almost 3 weeks. Did a loop from Capetown to Calvinia, Beaufort West, down to Oudtshoorn, Knysna and back.