Regexes and C

M

mark lewis 6 years ago

Re: Re: Regexes and C By: Alister to Martin Gregorie on Thu Mar 19 2020 22:25:52

i suspect the OP is attempting to mitigate command injections similar to those spoken of in this article...

formatting link

or

formatting link

if wordwrapping breaks the >70someodd character line width somewhere along the path...

)\/(ark

Vote

M

Martin Gregorie 6 years ago

I spent more time than I should have yesterday trying to understand regcomp(), regexec() and regerror() well enough to validate a string containing an e-mail address string to make sure that: its structure is correct and neither the username nor the domain contains characters they shouldn't.

The upshot was that I couldn't do it because I could not write a regex that would detect spaces in the address because apparently regcomp doesn't provide any way to anchor a regex to either end of a string, so I ended up with a negated regex that detects invalid characters in the string and hasn't a clue whether its syntactically correct:

[^.a-zA-Z0-9@_-]

This does the trick, but no thanks to the man pages regex(3), which describes the C functions, and regex(7), which describes the regex syntax. Both are poorly formatted, hard to read, and seem to have omitted useful information, such as the inability of specifying anchor points in strincs that DO NOT contain newlines.

So, can any of you do better, i.e. write a regex that CAN validate the syntax of an e-mail address in terms of its structure and the set of permitted characters on the username and domain parts (the permitted character sets are not the same).

Also, if anybody can suggest a better tutorial on using these functions or suggest another, better, set of C functions for doing the same job, that would be wonderful.

PS: I did check my old reliable standby text - David Curry's "UNIX Systems Programming for SVR4", but it wasn't helpful in this case because, unusually, the set of functions in the C Standard Library have changed both names and parameters since it was written.

Martin | martin at Gregorie | gregorie dot org

Vote

A

A. Dumas 6 years ago

More or less impossible. E.g. apparently you didn't think that + is a valid character, which it is (in the part before the @). Also, domains (and usernames) can be UTF8. Best way is: try to deliver, check reply.

Vote

M

Martin Gregorie 6 years ago

The sources I consulted said the only permitted nonalphanumerics in the usernames are period, hyphen and underscore, just as the only nonalphanumeric in the domain is the period.

Fair point - I should have said that I'm want to use this as a filter to prevent cross-site scripting attacks, i.e. to prevent the From address being used as an attack vector.

Another annoyance with regcomp/regexec is that the common :alnum: abbreviation is *only* recognised if it occupies the whole set of alternates, i.e. [:alnum:] works, [.:alnum:_-] doesn't.

All in all this looks like something that would be better done without using C regexes. IOW, either as a rather messy string comparison game or in Java using its pattern matching classes.

Martin | martin at Gregorie | gregorie dot org

Vote

T

Theo 6 years ago

It's extremely hard, but some people have tried:

formatting link

Definitely not a thing to make up yourself, you'll almost certainly get it wrong.

Theo

Vote

R

Roger Bell_West 6 years ago

No; email addresses cannot be syntactically validated by regexp alone.

R

Vote

D

DeepCore 6 years ago

Am 19.03.2020 um 14:29 schrieb A. Dumas:

We are using a small CRM that checks if there exists a MX record in the DNS for the domain part.

So, first check if domain is valid for e-mail, then try to deliver and check response ...

Vote

M

Martin Gregorie 6 years ago

Yep, so it seems - and anyway the examples on your second link are very unlikely to be accepted by recomp(), so I'll have a play with Java's pattern matching classes. IIRC they sidestep the UTF8 problem anyway.

Thanks for that.

Martin | martin at Gregorie | gregorie dot org

Vote

D

Dan Cross 6 years ago

It's not clear to me that the full syntax of email addresses can be represented in the regular languages. Undoubtedly a useful subset _can_, but in their full generality, you may need a push-down automoton.

What do you mean "doesn't provide any way to anchor a regex to either end of a string"? That's what the `^` and `$` metacharacters in the regex are for, and they're fully supported by the library.

Could you clarify what you mean? '$' will match the empty string at the end of a line, '^' matches the empty string at the beginning of a line. By default, the library ignores newlines entirely; they're only significant if you use the `REG_NEWLINE` flag to `regcomp()`.

Perhaps if you could post your code, one might be able to see an issue?

As far as other libraries, if you can link against C++ code, the RE2 library is very nice.

You'd want something that covers the POSIX interfaces.

- Dan C.

Vote

M

Martin Gregorie 6 years ago

Good idea, but not needed here because I only need to check the From address on incoming mail.

Martin | martin at Gregorie | gregorie dot org

Vote

M

Martin Gregorie 6 years ago

OK, I'm starting to see that, so it looks like my current strategy of inverting a bracket expression containing all the characters that can legitimately be in an e-mail address is about as far as I can go.

Doing this in either C or Java should be OK, since I'm only looking to stop From: headers being used as attack vectors on a bash script. AFAICR Bash only accepts ASCII, so any message whose From: address contains anything that isn't ASCII alphanumeric, '@', hyphen, underscore or period can be binned.

Martin | martin at Gregorie | gregorie dot org

Vote

A

Andreas Karrer 6 years ago

DeepCore :

The existence of an MX record is often a good idea but by no means a requirement. An A or CNAME record is perfectly OK.

- Andi

Vote

R

Roger Bell_West 6 years ago

You will be dropping valid mail if you do this.

We can start with + in the address, but really, we can play this game all day.

Vote

A

Ahem A Rivet's Shot 6 years ago

The perl book on regular expressions includes one that validates email addresses - it is extremely long (more than a page IIRC), they are astonishingly difficult things to validate.

There's this site

formatting link

which has what purports to be a good regex for the job:

(?:[a-z0-9!#$%&'*+/ =?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?: [\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\ [\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+ [a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9] [0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?: [\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\ [\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

You might need pcre (the perl compatible regex library).

Steve O'Hara-Smith | Directable Mirror Arrays C:\>WIN | A better way to focus the sun The computer obeys and wins. | licences available see You lose and Bill collects. | http://www.sohara.org/

Vote

H

Heap O'trouble 6 years ago

[snip]

1) To anchor a regex, use the '^' and '$' metacharacters. '^' matches the empty string at the start of a line, and '$' matches the empty string at the end of the line.

2) There is no regex that can validate email addresses with 100% certainty. You /can/ write a regex that will come close, but there will be valid outliers that your regex will call invalid.

3) The RFCs describe exactly what an email address can consist of. You want to study, at least, RFC 5322 section 3.4

formatting link

HTH

Lew Pitcher "In Skills, We Trust"

Vote

M

Mike Scott 6 years ago

On 19/03/2020 14:12, Martin Gregorie wrote: ....

IMBW But these do indeed /only/ work inside a [...] construct so shouldn't your examples read [[:alnum:]] and [.[:alnum:]_-]

(see eg

formatting link

under 'POSIX character classes. Also eg

formatting link

)

Mike Scott Harlow, England

Vote

D

Dennis Lee Bieber 6 years ago

On Thu, 19 Mar 2020 13:18:58 -0000 (UTC), Martin Gregorie declaimed the following:

formatting link

Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

Vote

M

Martin Gregorie 6 years ago

Fair point - never thought of that!

Martin | martin at Gregorie | gregorie dot org

Vote

M

Martin Gregorie 6 years ago

Just that:

My original regex was

"[a-zA-Z0-9][.a-zA-Z0-9_-]*@[a-zA-Z0-9][a-zA-Z0-9.]*[a-zA-Z0-9]*"

and matched a string containing "a snipped-for-privacy@d.e", so I changed it to

"^[a-zA-Z0-9][.a-zA-Z0-9_-]*@[a-zA-Z0-9][a-zA-Z0-9.]*[a-zA-Z0-9]*$"

and it *still* matched that string. So I reread regex(7) and this time noticed:

'^' (matching the null string at the beginning of a line), '$' (matching the null string at the end of a line)

Which, by its discussion of lines, seems to imply that regcomp/regexec thinks strings, i.e. shell parameters are somehow different from strings that have been filled by reading lines from a file.

Exactly so. But they don't match the ends of a string that was passed in as a command-line parameter.

I tried getting int C++ years ago when it first became common (think Borland C++) and hated it, found Bjarne Stoustrup's C++ far below the standard set by K&R and finally gave it up when I found all too much C++ code was in face just ANSI C with // comment delimiters.

Java beats the crap out of it, IMO anyway.

Quite possibly, though I'm constantly surprised by how useful and relevant it still is. This is about the first time it hasn't come up with the goods, though that says at least as much about how stable the C standard library's APIs are.

Would you care to recommend a POSIX book thats as good as the SVR4 one was in its time?

Martin | martin at Gregorie | gregorie dot org

Vote

T

The Natural Philosopher 6 years ago

No. That's why I don't bother with regex, ever.

Its far faster for me to write a series of tests in 'C' then try and work out what random gobbledygook will do the job in regex.

Regex is for nerds to impress other people with. Its not a smart way to program.

Same as SQL. By the time you have taken a day to write the SQL query that does everything you want, only to realise it takes 50 minutes to complete, you could have written most of it in C and got it down to 3 seconds...

?Those who can make you believe absurdities, can make you commit atrocities.? M. de Voltaire

Vote

Regexes and C

Join the Discussion

Didn't find your answer?