Attention: European C/C++/C#/Java Programmers-Call for Input

P

Paul K. McKneely 17 years ago

Hi All,

My company is developing a new programming language targeted at continuing with the original charter by the C language for development of Operating Systems in a HLL as well as applications, device drivers etc. This language has an extended character set and, although all of the key words will (still) be in English, identifiers (i.e. names of things) can use additional European characters (such as those with accents, diaeresis, cedilla etc). For efficiency, a 254-character subset of them are going to be used in creating a character space that encodes them into a single byte. These will not only be automatically byte-endian independent but will also be in alphabetic order so that sorting can take place directly on their numeric values. What I need from you is input so that I can select the most appropriate set for the benefit of European programmers who are obviously very talented at what they do. My thought is that it would be great if European programmers could give names to variables etc. in their own native languages that have more meaning for them than just plain old English words. The character subset includes full upper and lower case Greek as well as Cyrillic. I have seen Cyrillic (as well as Greek) with various accent marks (presumably used by eastern European countries)but there is not enough space in a byte to add any of these. However, I have added quite a few to the basic Roman character set that is used so much in English. Since I am an American, I don't have full appreciation for all of these special marks and symbols and that is why I am asking for your comments. I apologize for the low resolution of the glyphs (8 X 16). I do have a TrueType version in the works but it is incomplete. In the table, columns

0-8 are Roman and its variants. Greek is columns
9-B. Cyrillic occupies columns C-F. I was surprised how neatly these fell into columns. A reference on the subject of European character sets would be much appreciated. For those of you who are happy to give me feedback, I have attached a table that I have been using that represents the current subset used for identifiers. You may respond directly to me or to the news group for all to see. Much thanks to you.

Regards,

Paul King McKneely technoventure, inc.

Vote

B

Boudewijn Dijkstra 17 years ago

After reading your post, I must conclude that you are oblivious to key concepts and organizations surrounding internationalization and multilingual co-operations. It is a good thing that you sought advice from an intelligable community before re-inventing the wheel (badly).

Op Tue, 27 Jan 2009 15:09:33 +0100 schreef Paul K. McKneely :

Like Java does?

formatting link

Why just Europeans? Lots of software is written by Israeli (Hebrew), North-African (Arabic), Chinese (thousands of ideographs in different families) and Japanese (Katakana) people.

Like ISO8859?

formatting link

Impossible. Not every language sorts the same alphabet in the same way. E.g. sometimes accented characters are treated separately, sometimes they are 'equal' to the base character. The process of comparing text for sorting purposes is called collation.

formatting link

As far as I'm concerned, English is the only language that should be seen in source code elements (except maybe string literals). It is the language of choice for technical terms, the language from which programming languages derive their syntax, and overall the best known language amongst programmers worldwide. English is one of the few languages without accents and with relatively short words, thus allowing relatively efficient typing.

There are two other arguments against your proposal:

- As companies grow, their code flows across language borders. Should they hire translators to facilitate teams in different areas or hire teachers to teach everybody the language of choice?

- Multilingual countries like Belgium and Switzerland need to program in English in order to maintain the 'equality' of their individual languages.

Some of those are essential to be able to write common words in a given language. (I hope that you will learn to appreciate the special marks and symbols used by your Spanish-speaking fellow-Americans (amongst others), before you inadvertantly insult one.)

I have received no glyphs.

As stated, ISO8859 et al. Note that Microsoft has sactioned different character sets, Cp1252 is perhaps the most ubiqutous.

formatting link

Many USENET servers don't accept attachments. Please post a weblink.

Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/

Vote

S

Stefan Reuther 17 years ago

Like they already can in Java, C, and C++?

Support for Unicode characters is in the C and C++ standards, but many compilers don't implement it. This may give you a hint how many people want it. Being German, I am satisfied if I can use my fünny chäracters in cömments and strings. But even there, any code that has a remote chance of being shared with anyone else gets English comments. When a Finn came along wanting to help with a program I wrote, I had quite some work to explain my German comments (as an excuse, however, some of them were at that time over 10 years old, written in a time where I was not so fluent in English). But I would definitely not switch programming languages just to use my funny characters.

Why ignore Unicode and invent yet another incompatible encoding? How should people edit their source code? Remember, you'd have to build a whole toolchain supporting your new character set. If my embedded programs make serial outputs in German, they use the Latin transcription, because terminal programs don't even agree upon whether to use Latin-1 or Codepage-437/-850.

Automatic alphabetic sorting is not a useful goal one would want from a character encoding, because it's not possible in general, and doesn't save you any work if you want to do it right for your problem.

- In German telephone books, "ä" sorts as "ae" (the official Latin transcription). In German dictionaries, "ä" sorts as "a". In Finnish, it sorts after "z".

- Almost everywhere, "ß" sorts as "ss". It also doesn't have a wide-spread capital equivalent (although an Unicode codepoint has been allocated for it recently).

- In Turkish, the capital letter of "i" is "?" (U+0130), and the lower-case letter of the thing you know as a capital "I" is "?" (U+0131).

Even though it might be possible to fit most Western and Central European languages plus the standard ASCII repertoire into a common

8-bit character set, you'll probably have to ignore Cyrillic and Greek, and still tweak a bit. Latin-1 and Latin-2 taken together have about 280 characters, not counting control charactes.

One attempt of such a character set is the EBU character set used in RDS/RDBS, e.g. ftp://ftp.rds.org.uk/pub/acrobat/rbds1998.pdf page 92; I haven't checked how complete it is. However, it was probably designed with the intend to implement it on 8-bit micros :-)

"The Unicode Standard, Version 5.0". Plus Wikipedia.

Stefan

Vote

B

Boudewijn Dijkstra 17 years ago

Op Tue, 27 Jan 2009 19:54:46 +0100 schreef Stefan Reuther :

And low-quality graphics, too. E.g. greek small and capital theta were merged. Also they have omitted Greek and Cyrillic A(lfa) and B(eta) because the appearance is the same as latin A and B. So strictly speaking it is not a character set but a glyph encoding.

"The three code-tables each contain almost all the characters in the international reference version of ISO Publication 646."

ISO 646 is the predecessor to Unicode; in that time they thought that 16 bits would be enough for all conceivable characters.

Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/

Vote

F

Frank Buss 17 years ago

Are there any important surrogate planes in unicode? I don't mean things like this one :-)

formatting link

Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de

Vote

F

Falk Willberg 17 years ago

I am looking forward to read source-code like this:

principal(de_tout arg_compteur, signe *arg_horaire) { ???????? ??????? // kokonaisluku hakemisto; terwijl (de kleinere tellers zeven is) { } "geschwofelte Klammer zu" ...

English or any other "lingua franca" is OK.

SCNR, Falk

Vote

P

Paul K. McKneely 17 years ago

Hi,

Now that IS funny. This is the very thing that the programming community doesn't want. Don't forget, Arabic and Hebrew are read from right to left. Is the above code what an LR parser is for? Or should it be called an LR/RL parser? What I had in mind is more like ?=3.1415926; The English speaking world has used a lot of Greek letters for variables during that past few centuries. It wouldn't be much of a shock for programmers to suddenly be able to use ? instead of pi.

Paul

Vote

P

Paul K. McKneely 17 years ago

Thank you for being so polite and humble. Let me say that the new language is not about internationalization. It is about providing a much more powerful programming environment than is available with standard languages. (I know I expect to get a lot of flames from that last statement. I understand that there are a lot of insecure people in the world who will feel outrage with just about anything I have to say. Such is the price for a small amount of useful feedback).

No body in their right mind would try to write an operating system (or a device driver!) in Java. With no pointers and only signed integers, it would be like programming with a straight jacket on. And what would happen when an interrupt happened and the Java engine decided it was time for garbage-collection in the middle of an interrupt service routine?

Let me answer your question with your own words:

The output of the software development tool chain is for programmers only. I don't think everyone else will care if the ordinal rules don't conform to every village on the planet.

I didn't propose anything. I asked for input. And your comments are well taken but do not address my request.

Sort of like the way you started to insult me with your first remarks? I can see that. I'll try not to follow your lead and I might be okay. I worked with a Mexican-American one time who did give me useful feedback along with a funny story. The ?/? are in the character subset.

Paul

Vote

P

Paul Keinanen 17 years ago

That is one good usage for an extended character set that I would have needed several times.

However, I do not understand the need to invent yet another single byte character encoding. Why not simply use Unicode with UTF-8 encoding and if necessary, restrict it with a suitable subset, such as MES-2 or WGL-4

formatting link

to simplify editing on various platforms (availability of fonts etc.).

Paul

Vote

F

Frank Buss 17 years ago

I think Boudewijn just wanted to show you a language, which already has the feature you want, so maybe it would be helpful to take a look at it for designing your new language.

Looks like there is already such a system:

formatting link

I don't know it in detail, but looks like they can use hardware resources in an object oriented and safe way:

formatting link

And Microsoft has a research project, which uses a virtual machine for implementing an OS:

formatting link

Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de

Vote

J

Jack 17 years ago

It depends on what is faster. Even if you have the gliph of pi in the characterset, if the keyboard doesn't have an easy access to it in my opinion it will not be used. Writing "pi" is two keys, writing the gliph on a standard keyboard is at least 3 keys (for example ctrl-shift-p). The two keys combination are already used (shift-keys for capital letters, ctrl-keys and alt- keys for program functions). So I think that is faster to continue use the transcription.

And for the moment we are speaking about western alphabets (latin, greek,...), but what about asian alphabets? You want to encode them too? I wish to see a program with variables names written in chinese ideograms, where if I remember well sometimes the meaning depends on which ideogram is near the one you are reading (or writing).

I think Mr. Dijkstra is right. For program code use the less complex character set you can find (ie. ASCII), for comments, variable and function name and so on English should be the language to use. For strings use Unicode.

Bye Jack

Vote

D

David Brown 17 years ago

There are times when non-English identifiers such as pi, or the Greek lower-case letters, could be useful. But they are few and far between, mostly restricted to mathematical programming. And if you want to be able to write pi as a single Greek letter identifier, you also want to be able to write identifiers with subscripts, and very soon a wide range of proper mathematical notation. This would lead to chaos very quickly (and it's already been done - it's called APL).

Allow non-ASCII characters in comments and strings. Your choices are to either fix on Latin-1, fix on UTF-8, or allow different encodings with an identifier at the start of the file. I'd go for UTF-8 as a modern choice that works well and allows a very wide range of characters, while working with a great range of existing tools.

Trying to invent your own character set, encodings, and orderings is about as useful to your users as using Esperanto for the documentation "in order to keep it international". It's a sure way to guarantee that your project will fail.

Vote

F

Falk Willberg 17 years ago

Paul K. McKneely schrieb:

The line, your newsreader messed up, is a comment, as "//" is at the

*start* of the line ;-)

In a world, where even newsclients are unable to declare the character-set, that was used, everything would read ??=3?14... or US?=?*1?3... (3.1415 would be written 3,1415 in german)

Falk

Vote

B

Boudewijn Dijkstra 17 years ago

Op Wed, 28 Jan 2009 01:12:00 +0100 schreef Paul K. McKneely :

Don't be so bitter. You were the one trying to cater to Europeans (exclusively) without having a clue about the current situation nor the wants and needs of those Europeans.

How is "it would be great if European programmers could [write] in their own native languages" not about internationalization?

OK, probably a praiseworthy goal. But what does that have to do with Europeans in particular and their languages in general? And maybe it would be wise to outline the shortcomings of the "standard languages" so that people can better flame you.

I said "like Java", we were talking about the character set of the language, not about the available types and other grammatical elements.

Read JSR-1.

formatting link

So what's the point of using an extended character set you if agree to be using English anyway?

Why sort users' result different than programmers' results? It can only be confusing and annoying.

Besides asking for input, you apparantly created a 'proposed' character set (which I still haven't seen).

As I have read it, your request was about an extended character set to be used by Europeans to program (partly) in their native language. If this is not so, then please re-phrase your request as I have mis-interpreted your general direction with this 'project'.

That was not inadvertant. I am European; get used to it. ;)

I am definately not leading, merely interrogating and questioning your direction and your reasoning.

Gemaakt met Opera's revolutionaire e-mailprogramma: http://www.opera.com/mail/

Vote

P

Paul K. McKneely 17 years ago

Hi Falk,

I actually read the Arabic correctly. It was somewhere in the reply step where it was changed. Thank you for being polite. Boudewijn Dijkstra is wrong in his implication that all Europeans are rude jerks. I remember someone saying one time: "Be polite and considerate. You never know who might end up being your boss."

Paul

Vote

P

Paul Keinanen 17 years ago

OTOH, if the program refers for instance to an external record dealing with some purely national entities (such as defined by the national legislation), should the programmer invent some unofficial English translation for these entities or use the name without accented characters ?

However, at least in Finnish, doing the ä=>a and ö =>o translation might end up into an other word with completely different meaning. In the worst case, two identifier in the same record might end up into the same US-ASCII representation.

IMHO, as a former Fortran programmer, 6 bit characters and 6 characters identifiers should be enough :-) :-)

Paul

Vote

S

Stefan Reuther 17 years ago

Come on, this is usenet, not kindergarten. And if you're trying to revolutionize the world, you should be able to endure a little sarcasm.

Actually, your post reminded me of something. When I was 15, I tried to revolutionize the world with a new programming system as well. You started with a character set - I started with an object file format. So I defined the object file format that "would be able to store code for all processors on the planet", without having ever seen anything other than a Z80 and an x86. Far call patching? Alignment? Link-time inlining? What's that? But surely everyone has segment registers.

This is similar with your character set. Almost nobody really needs identifiers in native language, because that messes up interoperability. If I have a printed manual saying I should call function ?, how would I do that if I don't find it on my keyboard? And a character set that collates nicely (outside A-Z) is rarely of use, too. When I sort things, I either don't care how exactly it is sorted, I just want to be able to find things with a binary search. Or I want a locale-specific, case-blind sort, which, as I've shown, can differ widely depending on the actual locale used.

Long ago, for a semester project, we tried to use a coding style using our native language. We set the rule: all code we write has to be in German, so we can more easily tell what is our code and what is code imported from the runtime library and the framework. This taught us two things: (1) mixed-language code looks ugly, because half of the accessor functions are called get/set, and the other half is gib/setze. (2) umlauts break tools. Javadoc refused to generate an index containing umlauts, and all the code metric tools our teacher tried to use crashed and burned on the code. I ultimately hacked up a perl script to get the metrics.

Stefan

Vote

S

Stefan Reuther 17 years ago

Funny that even today's software cannot post a correct ?.

Mathematicians use greek letters and funny fonts because they don't have multi-character identifiers. When they write "?", they usually mean "angle". I actually consider it an advantage to be able to write "angle" in my programs.

? might be an exception because it's so prominently known, but it's nothing I would design my language around. Especially in embedded/DSP contexts, trig functions often have a period of, say, 64, not 2? :-)

Stefan

Vote

F

Falk Willberg 17 years ago

Paul Keinanen schrieb: ...

How do you substitute ä/ö/ü? Germans write ae/oe/ue instead.

But Finnish is a good example. Most European lanuages are successors of Latin or heavily influenced by Latin. So it is possible to understand comments in e.g. Italian.

I am working on C-code, which is partially commented in Finnish. I can't understand any word. Luckily all objects are named in English and

formatting link

translates to Finnish.

IMO, as a former BASIC-programmer, 6 charaters should be minimum for any instance, that is valid over more than two lines :-)

Code should be written in english. Comments, if possible, too.

Falk

Vote

P

Paul Keinanen 17 years ago

I have never seen such transitterations in any Finnish programs, we just drop the dots.

Finnish is a Uralic language (such as Estonian and Hungarian).

Those are Indo-European languages.

My comment might be a bit outdated, but from the job security point of view in the 1990's, using national identifier names is a good idea:-).

Paul

Vote

Attention: European C/C++/C#/Java Programmers-Call for Input

Join the Discussion

Didn't find your answer?