Regexes and C

Doesn't work with gmail and other big sites. They accept the mail then bounce it back later. DAMHIKT

--
  ?A leader is best When people barely know he exists. Of a good leader,  
who talks little,When his work is done, his aim fulfilled,They will say,  
 Click to see the full signature
Reply to
The Natural Philosopher
Loading thread data ...

Not done regex in C recently but for me the trick has always been getting the regex expression right. So I use an online tool to test build regex expressions and see how they work.

I can't remember which one I used last, but something like this.

Much quicker than testing regex in your own code.

Reply to
Pancho

Wow, that's garbage.

Just go with: it has at least one @ and is not longer than [some number which is smaller than what you expect to be the minimum buffer size, e.g.

250]
Reply to
A. Dumas

Hmm. Not on my system:

: gaja; cat re.c #include

#include #include #include

const char *RE = "^[a-zA-Z0-9][.a-zA-Z0-9_-]*@[a-zA-Z0-9][a-zA-Z0-9.]*[a-zA-Z0-9]*$";

int main(int argc, char *argv[]) { regex_t re;

int err = regcomp(&re, RE, 0); if (err != 0) { char errbuf[128]; regerror(err, NULL, errbuf, sizeof(errbuf)); fprintf(stderr, "regcomp failed: %s\n", errbuf); return EXIT_FAILURE; } for (int i = 1; i < argc; i++) if (regexec(&re, argv[i], 0, NULL, 0) == 0) printf("The string %s matches\n", argv[i]);

regfree(&re);

return EXIT_SUCCESS; } : gaja; make re cc -O2 -pipe -o re re.c : gaja; ./re 'a snipped-for-privacy@d.e' : gaja; ./re ' snipped-for-privacy@d.e' The string snipped-for-privacy@d.e matches : gaja;

Note that 'a snipped-for-privacy@d.e' did NOT match.

My system includes this in regex(3), when discussing newlines:

REG_NEWLINE Compile for newline-sensitive matching. By default, newline is a completely ordinary character with no special meaning in either REs or strings. With this flag, `[^' bracket expressions and `.' never match newline, a `^' anchor matches the null string after any newline in the string in addition to its normal function, and the `$' anchor matches the null string before any newline in the string in addition to its normal function.

That is, newlines are ordinarily treated like any other line.

Are you sure you're matching against the string you think you are? In particular, are you sure the string your program is matching against actually contains a space?

You don't have to program in C++ to use RE2. Just be able to link against a program that is written in C++.

I think the latest version of "Programming in the Unix Environment" is quite good. It has been kept up to date since the unfortunately premature death of W Richard Stevens. I don't recall whether it covers regular expressions, though.

It's been many years since I have used a book for that kind of thing, so I'm afraid my recommendations for specific texts are dated. :-(

- Dan C.

Reply to
Dan Cross

I won't argue with that.

Here I DO disagree.

SQL is fine unless you insist on writing huge, do-everything queries. I was involved with one of them years ago as part of a benchmarking exercise and that pretty much put me off writing that sort of thing for good. Never tried SQL procedures either, but concise SQL queries used judiciously within logic written in C or Java work very well and are easy enough to write and maintain.

Most SQL performance problems, IME anyway, boil down to crap database design, meaning bad or nonexistent normalisation and incorrectly placed or missing indexes. But, given that a relational database has a decent, user-friendly query analyser and there's enough realistic test data its generally quite simple to get the speed up to where it should be.

Of course, if the DBA can't normalise and doesn't understand an ERD, and if the system designers only provide small amounts of largely imagined data rather than a few hundred or thousand actual business data items, then OF COURSE the database performance will be crap.

Don't ask me how I know that: I've been called on too many times to fix that sort of mess. But sometimes the clients got it right. In one project it was very nice indeed to be given half a million records of valid test data. That was for the last major DB I worked on: I did much of the design and then tuned it using that huge pile of actual data. It really sang from the off.

--
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

Yes, I agree.

I've used that one for PCRE regexes, but its often just as easy to test them using grep with the -P option set. I use PCRE a lot more thn other flavours because I have SpamAssassin installed and maintain a private rule set.

For Java regex testing I've also used this:

formatting link

However, is there a similar test harness for regcomp(), regexec() and friends? Ans a readable online document for that regex flavour?

--
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

Stop trusting those sources; they don?t know what they?re talking about. Use RFC5321 and RFC5322 instead.

That?s fundamentally the wrong approach. Instead, use an appropriate quoting/escaping scheme. See

formatting link
for many examples.

--
https://www.greenend.org.uk/rjk/
Reply to
Richard Kettlewell

Email a random number to the address. Make the punter come back and type that number in. Then, and only then, do you know the email is valid (and belongs, in some sense, to the punter wanting your wares...)

(Yes, someone could be MITMing your email connection, but then you have bigger problems!)

--
Ian 

"Tamahome!!!" - "Miaka!!!"
Reply to
Ian

Interesting stuff, but its all HTML and JS-related - nothing much there I can use outside that environment.

I'm dealing with bog standard e-mails which can have been sent from almost any hardware using almost any software and at the immediate point of interest, are being passed between by processes written in Python, C and bash. My immediate concern is to sanitise sender addresses being passed through a bash script, which is the only piece of the puzzle written my myself apart, of course, from the sanitiser.

--
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

In this case regex is not the problem, the problem is that email addresses are not designed to be parsed.

They're a tool like any other, useful when they help not so much when they get in the way.

They have their uses - dismissing a powerful tool is not a smart way to program either.

Complex SQL queries are often a mistake - but at least SQL is reasonably consistent for the simple stuff.

--
Steve O'Hara-Smith                          |   Directable Mirror Arrays 
C:\>WIN                                     | A better way to focus the sun 
 Click to see the full signature
Reply to
Ahem A Rivet's Shot

One more in a mature system - feature creep. The database was fine for the original spec but nobody optimised for the new queries and tables that got added for new features and worked fine testing the new features against live data - it's just a pity what it did to the performance under load once they got heavily used.

--
Steve O'Hara-Smith                          |   Directable Mirror Arrays 
C:\>WIN                                     | A better way to focus the sun 
 Click to see the full signature
Reply to
Ahem A Rivet's Shot

It's doable, but hard. My email address for this post is real and works. But I selected it because it looks wrong to bad email address sanitizers. Many years ago I used "#@..." instead, but I found out it was breaking UUCP because the addresses were being passed unescaped to sh which saw it as a comment. This was late 1990s, long after UUCP had mostly gone away. My goal with the address to to thwart spammers, not legit users, so I switched to the current style. Globbing rules mean that it would only be risky with a file that matches the domain in the working directory, which is fortunately very unlikely.

For proof of concept I created a address once, again real and deliverable. It was live for a year or so. In general, the local part of an email address has very few hard and fast rules. You'll find more rules in what commercial email providers are willing to let you send as than in what the software needs to support.

You are best off if you can avoid ever letting email addresses be interpreted by the shell. Stick them in "files" (stream for pipes count as files) and have mail programs parse them out of the stream.

It's probably mentioned else-thread, and I haven't seen it yet, but RFC821 (or is it 822?) comments in header lines are basically regexp-proof. The simple cases can be handled, but not the full complexity. The full complexity is basically only used by people being deliberately difficult, so you don't run into it often. The part you can't handle with regexp, at least in a single pass: the balanced parenthesis for nested comments rule.

Elijah

------ years ago used to maintain the "+ addressing" FAQ

Reply to
Eli the Bearded

I've been reading this for some time now and I really do not understand this hassle.

I think I even read somewhere few years ago when studying all types of online registration and verification mechanics, that mail address was not meant to be checked on the client side for validity. Just let the SMTP do this for you and handle the result. This was confirmed by experts in the field AFAIR (18mil mail addresses and > 200 domains) when I was involved in a DMARK roll out there.

In case of internal/external address let the SMTP do this for you - use the configuration to handle all that issues.

I would rather accept the mail and let be delivered. If it can not be delivered, it will return anyway. Use your local SMTP to handle local domains and add rewrite rules there for the external domains. Anything else is dropped. It's so simple.

If you need something like those "create account" templates, it is usually handled by verification link. Well there are those sites who give you temporary mail address, but they are usually filtered (by precendence the least).

And also what do you mean by sanitize?!

And last but not least - there is nothing greater than perl for working with regex.

But as it was already mentioned things changed over the years - you can have UTF/unicode and most of the examples are not working, however I guess people took care of that already

formatting link

And if you need it fast you can embed

formatting link

Reply to
Deloptes

& why would you expect bogus messages to be using an invalid sender address (quite frankly given the difficulty in validating an email address actually generating an invalid one must be almost as difficult) sanitise the data you are actually processing. if it is the sender address that is being stored & processed elsewhere then use a registration method that requires confirmation befor it is accepted.
--
There are few people more often in the wrong than those who cannot endure 
to be thought so.
Reply to
Alister

Thanks for your example code - clarified a couple of parameter definitions that where rather unclear in the manpage and I therefore didn't understand. regerror() now works and so do the ^ and $ endpoint anchors.

My code was somewhat larger than an SSCCE due to other stuff I wanted it to do, which is why you didn't see it.

SSCCEs don't seem to be much known outside Java circles, so here's a reference: its actually a very good idea that can be used with almost any computer language:

formatting link

Sounds useful. Note made to look at it.

Call me old-fashioned, but I find a good book a lot more usable for reference than any reasonable sized screen or e-book I've seen so far. I'm likely to hold that view until the standard E-book becomes rather more booklike, i.e. with two double-sided pages used for, from front to back:

Contents, two facing pages for the main content, and an index page.

This would let a reader find stuff fast because content and index pages wouldn't have to be scrolled to all the time. If the whole thing is thin and uses e-ink displays, so much the better.

--
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

Quite, but for a long time now I've been using getmail to retrieve mail from my ISP. It pops it into a pipeline, defined within a bash shell, which runs mail through SpamAssassin before passing it to a C program I wrote that quarantines any mail that got marked as spam before passing the rest to Postfix sendmail, which hands it to my Postfix server for delivery to local mailboxes.

I do it this way because it means that I have no open ports on my firewall.

Meanwhile, I've needed to change the getmail configuration, which involved talking to its author to clarify some points and during this I got told I was a very bad boy for daring to to do something so dangerous as feeding a bash-defined pipeline from getmail: the only thing that gets passed as a parameter is the From: address, so I thought I should at least make an attempt to make sure nothing bad can subvert this single parameter being passed. The pipeline has been running for years with no problems.

The only thing that's changed is that a new system I want to run will require getmail to be run with a higher privilege level than at present so it can deliver mail directly to the new system via a revised Destination configuration.

The old pipeline will remain unchanged apart from (maybe) adding some rat traps to passing the from address into it

So, given that I now have two versions of the address filter running (one using a version of my original regex and the other doing an inverted match to reject and addresses with unwanted character values, do I really need to use either of them?

Thanks: I'll look at those tomorrow.

--
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

Yep, that would do it too, but its really just another case of same old, same old as either the original design culprits are still there, or they've moved on, leaving a largely undocumented system for the new guys to sort out. Or, of course, worse still, maintenance and enhancements have been outsourced.

My point being that its often the same mistakes being warmed over again.

I'm lucky in that, back when IDMS was a thing, I got some really good training on its care and feeding. I largely picked up relational databases on the job, but a surprising amount of what I learnt about IDMS design was also relevant for RDBMS, especially the preliminaries: data normalisation and using with Entity-Relationship diagrams to design the schema.

--
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

exactly what I said. Dont use SQL to do complex stuff - its very hard to get the syntax right and it runs like a dog with three legs ampurated.

No. In the case where I did the biggest job- it was normalising a flat database of a few million UK postcodes into a relational one - none of these were the problem.

What was the problem was Mysqls inability to create good optimised machine code out of SQL statements. Unlike - say - moderb C compilers which astound me in their ability to write better assembler than I could myself, MySQL is like going back to the first 8 bit C compilers I used.

But, given that a relational database has a decent,

On simple queries, yes, bit not on complex ones involving conditional selections of selections etc.

When I had fished what I wanted ran well, with over a million records but it did not use complex queries.

Creating it from the data I started with would have, if I hadn't given up trying to do the whole job with SQL and restricted myself to simple queries, building enormous linked lists in C - over a gigabyte in size - and thinking hard about how I would access the contents.

--
?People believe certain stories because everyone important tells them,  
and people tell those stories because everyone important believes them.  
 Click to see the full signature
Reply to
The Natural Philosopher

All too true.

Hmm similar, I cut my DBA teeth on a thing called MDBS-III, a network database engine that ran on CP/M, MP/M and MS-DOS[1]. Getting the schema right really mattered because changing it was a *pig*.

[1] It wasn't until the AT that the MS-DOS version was the fastest.
--
Steve O'Hara-Smith                          |   Directable Mirror Arrays 
C:\>WIN                                     | A better way to focus the sun 
 Click to see the full signature
Reply to
Ahem A Rivet's Shot

Reading your post is telling me, you did not want to read the RFC, you did not want to commit to the RFC, you want to do it your way, you have little to no understanding of how e-mailing works.

Why bother all of the people? I mean it is public list. You can post whatever you want, but ... anyway. Do whatever you, however you want.

You know we have 2020 and not 1990 :) - you probably want to check imapsync, which I would be using if I were you. But even this I can not understand.

So you have your mail server/domain - somewhere else, why would you download it locally? What kind of ports do you need on the firewall? Why?

And programing regex in C for a custom selfimposed issue! This is simply too much for me :/ Why on the PI list?

Reply to
Deloptes

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.