Regexes and C

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
I spent more time than I should have yesterday trying to understand  
regcomp(), regexec() and regerror() well enough to validate a string  
containing an e-mail address string to make sure that: its structure is  
correct and neither the username nor the domain contains characters they  
shouldn't.

The upshot was that I couldn't do it because I could not write a regex  
that would detect spaces in the address because apparently regcomp  
doesn't provide any way to anchor a regex to either end of a string, so I  
ended up with a negated regex that detects invalid characters in the  
string and hasn't a clue whether its syntactically correct:

[^.a-zA-Z0-9@_-]

This does the trick, but no thanks to the man pages regex(3), which  
describes the C functions, and regex(7), which describes the regex syntax.
Both are poorly formatted, hard to read, and seem to have omitted useful  
information, such as the inability of specifying anchor points in strincs  
that DO NOT contain newlines.

So, can any of you do better, i.e. write a regex that CAN validate the  
syntax of an e-mail address in terms of its structure and the set of  
permitted characters on the username and domain parts (the permitted  
character sets are not the same).

Also, if anybody can suggest a better tutorial on using these functions
or suggest another, better, set of C functions for doing the same job,  
that would be wonderful.

PS: I did check my old reliable standby text - David Curry's "UNIX  
Systems Programming for SVR4", but it wasn't helpful in this case  
because, unusually, the set of functions in the C Standard Library have  
changed both names and parameters since it was written.


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
On 19/03/2020 14:18, Martin Gregorie wrote:
Quoted text here. Click to load it

More or less impossible. E.g. apparently you didn't think that + is a  
valid character, which it is (in the part before the @). Also, domains  
(and usernames) can be UTF8. Best way is: try to deliver, check reply.

Re: Regexes and C
On Thu, 19 Mar 2020 14:29:35 +0100, A. Dumas wrote:

Quoted text here. Click to load it
The sources I consulted said the only permitted nonalphanumerics in the  
usernames are period, hyphen and underscore, just as the only  
nonalphanumeric in the domain is the period.

Quoted text here. Click to load it
Fair point - I should have said that I'm want to use this as a filter to  
prevent cross-site scripting attacks, i.e. to prevent the From address  
being used as an attack vector.  

Another annoyance with regcomp/regexec is that the common :alnum:  
abbreviation is *only* recognised if it occupies the whole set of  
alternates, i.e. [:alnum:] works, [.:alnum:_-] doesn't.

All in all this looks like something that would be better done without  
using C regexes. IOW, either as a rather messy string comparison game or  
in Java using its pattern matching classes.
  

--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
On 19/03/2020 14:12, Martin Gregorie wrote:
....
Quoted text here. Click to load it

IMBW But these do indeed /only/ work inside a [...] construct so  
shouldn't your examples read
[[:alnum:]]
and
[.[:alnum:]_-]

(see eg
https://perldoc.perl.org/perlrecharclass.html#Bracketed-Character-Classes
under 'POSIX character classes. Also eg
https://stackoverflow.com/questions/1085083/regular-expressions-in-c-examples
)




--  
Mike Scott
Harlow, England

Re: Regexes and C
On Thu, 19 Mar 2020 17:03:17 +0000, Mike Scott wrote:

Quoted text here. Click to load it
Fair point  - never thought of that!


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
Quoted text here. Click to load it

Stop trusting those sources; they don?t know what they?re talking about.
Use RFC5321 and RFC5322 instead.

Quoted text here. Click to load it

That?s fundamentally the wrong approach. Instead, use an appropriate
quoting/escaping scheme. See
https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html
for many examples.

--  
https://www.greenend.org.uk/rjk/

Re: Regexes and C
On Thu, 19 Mar 2020 20:30:58 +0000, Richard Kettlewell wrote:

Quoted text here. Click to load it
Cross_Site_Scripting_Prevention_Cheat_Sheet.html
Quoted text here. Click to load it
Interesting stuff, but its all HTML and JS-related - nothing much there I  
can use outside that environment.  

I'm dealing with bog standard e-mails which can have been sent from  
almost any hardware using almost any software and at the immediate point  
of interest, are being passed between by processes written in Python, C  
and bash. My immediate concern is to sanitise sender addresses being  
passed through a bash script, which is the only piece of the puzzle  
written my myself apart, of course, from the sanitiser.


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
Quoted text here. Click to load it

It's doable, but hard. My email address for this post is real and works.
But I selected it because it looks wrong to bad email address sanitizers.
Many years ago I used "#@..." instead, but I found out it was breaking
UUCP because the addresses were being passed unescaped to sh which saw
it as a comment. This was late 1990s, long after UUCP had mostly gone
away. My goal with the address to to thwart spammers, not legit users,
so I switched to the current style. Globbing rules mean that it would
only be risky with a file that matches the domain in the working
directory, which is fortunately very unlikely.

again real and deliverable. It was live for a year or so. In general,
the local part of an email address has very few hard and fast rules.
You'll find more rules in what commercial email providers are willing to
let you send as than in what the software needs to support.

You are best off if you can avoid ever letting email addresses be
interpreted by the shell. Stick them in "files" (stream for pipes
count as files) and have mail programs parse them out of the stream.

It's probably mentioned else-thread, and I haven't seen it yet, but
RFC821 (or is it 822?) comments in header lines are basically
regexp-proof. The simple cases can be handled, but not the full
complexity. The full complexity is basically only used by people being
deliberately difficult, so you don't run into it often. The part you
can't handle with regexp, at least in a single pass: the balanced
parenthesis for nested comments rule.

Elijah
------
years ago used to maintain the "+ addressing" FAQ

Re: Regexes and C
On Thu, 19 Mar 2020 21:26:22 +0000, Eli the Bearded wrote:

Quoted text here. Click to load it
Quite, but for a long time now I've been using getmail to retrieve mail  
from my ISP. It pops it into a pipeline, defined within a bash shell,  
which runs mail through SpamAssassin before passing it to a C program I  
wrote that quarantines any mail that got marked as spam before passing  
the rest to Postfix sendmail, which hands it to my Postfix server for  
delivery to local mailboxes.  

I do it this way because it means that I have no open ports on my  
firewall.

Meanwhile, I've needed to change the getmail configuration, which  
involved talking to its author to clarify some points and during this I  
got told I was a very bad boy for daring to to do something so dangerous  
as feeding a bash-defined pipeline from getmail: the only thing that gets  
passed as a parameter is the From: address, so I thought I should at  
least make an attempt to make sure nothing bad can subvert this single  
parameter being passed. The pipeline has been running for years with no  
problems.

The only thing that's changed is that a new system I want to run will  
require getmail to be run with a higher privilege level than at present  
so it can deliver mail directly to the new system via a revised  
Destination configuration.  

The old pipeline will remain unchanged apart from (maybe) adding some rat  
traps to passing the from address into it  

So, given that I now have two versions of the address filter running (one  
using a version of my original regex and the other doing an inverted  
match to reject and addresses with unwanted character values, do I really  
need to use either of them?

  
Quoted text here. Click to load it
Thanks: I'll look at those tomorrow.


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
Martin Gregorie wrote:

Quoted text here. Click to load it

Reading your post is telling me, you did not want to read the RFC, you did
not want to commit to the RFC, you want to do it your way, you have little
to no understanding of how e-mailing works.

Why bother all of the people? I mean it is public list. You can post
whatever you want, but ... anyway. Do whatever you, however you want.

You know we have 2020 and not 1990 :) - you probably want to check imapsync,
which I would be using if I were you. But even this I can not understand.

So you have your mail server/domain - somewhere else, why would you download
it locally? What kind of ports do you need on the firewall? Why?

And programing regex in C for a custom selfimposed issue! This is simply too
much for me :/ Why on the PI list?




Re: Regexes and C
On Fri, 20 Mar 2020 09:11:39 +0100, Deloptes wrote:

Quoted text here. Click to load it
Do you ever LOOK at anything you read? If so you would have realised that  
was posted late at night when I was tired.

In future try THINKING before getting critical.
  

--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
Martin Gregorie wrote:

Quoted text here. Click to load it

I've been reading this for some time now and I really do not understand this
hassle.

I think I even read somewhere few years ago when studying all types of
online registration and verification mechanics, that mail address was not
meant to be checked on the client side for validity. Just let the SMTP do
this for you and handle the result. This was confirmed by experts in the
field AFAIR (18mil mail addresses and > 200 domains) when I was involved in
a DMARK roll out there.

In case of internal/external address let the SMTP do this for you - use the
configuration to handle all that issues.

I would rather accept the mail and let be delivered. If it can not be
delivered, it will return anyway.
Use your local SMTP to handle local domains and add rewrite rules there for
the external domains. Anything else is dropped. It's so simple.

If you need something like those "create account" templates, it is usually
handled by verification link. Well there are those sites who give you
temporary mail address, but they are usually filtered (by precendence the
least).  

And also what do you mean by sanitize?!

And last but not least - there is nothing greater than perl for working with
regex.  

But as it was already mentioned things changed over the years - you can have
UTF/unicode and most of the examples are not working, however I guess
people took care of that already
(https://learn.perl.org/examples/email_valid.html )

And if you need it fast you can embed
https://stackoverflow.com/questions/1616217/using-perl-with-compiled-c-library



Re: Regexes and C
On Thu, 19 Mar 2020 21:04:05 +0000, Martin Gregorie wrote:

Quoted text here. Click to load it



& why would you expect bogus messages to be using an invalid sender  
address (quite frankly given the difficulty in validating an email  
address actually generating an invalid one must be almost as difficult)
sanitise the data you are actually processing.  
if it is the sender address that is being stored & processed elsewhere  
then use a registration method that requires confirmation befor it is  
accepted.

--  
There are few people more often in the wrong than those who cannot endure
to be thought so.

Re: Regexes and C
  Re: Re: Regexes and C
  By: Alister to Martin Gregorie on Thu Mar 19 2020 22:25:52


 > & why would you expect bogus messages to be using an invalid sender
 > address (quite frankly given the difficulty in validating an email
 > address actually generating an invalid one must be almost as
 > difficult) sanitise the data you are actually processing.

i suspect the OP is attempting to mitigate command injections similar to those  
spoken of in this article...

https://exploitbox.io/paper/Pwning-PHP-Mail-Function-For-Fun-And-RCE.html

or

https://tinyurl.com/m4g5664

if wordwrapping breaks the >70someodd character line width somewhere along the  
path...


)\/(ark

Re: Regexes and C
On Thu, 19 Mar 2020 22:35:50 +1300, mark lewis wrote:

Quoted text here. Click to load it
RCE.html
Exactly so. Its not common, but it can also be used to inject a poison  
pill into the recipient's system.  

Its well-known that the From: header is not used at all to transfer mail  
from sender to receiver - returned bounces are sent to the Reply-To  
address. The only defined use of From: is to be displayed by the  
receiving mail reader (MUA). Any other use is entirely up to the  
recipient and their system.  

A common use for the From: header is in mail archives, which typically  
index emails by sender, recipient, subject and date, but the wise  
archivist knows that the From: header can be, and frequently is, a pack  
of lies.  

Take a careful look at the next piece of spam you receive that's  
apparently from a friend. Many MUAs default to showing just the from text  
rather than both text and internet mail address. If yours is one of  
those, reconfigure it to show both. This gives you the ability recognise  
spam without opening it.  

Then use your MUA to look at all the headers and you'll see that spammers  
are often both lazy and stupid: they often change the sender text to  
spoof the victim but both From: and Reply-To: both contain their real  
address - unless, that is, that the message was sent from a compromised  
system, in which case a common pattern is: From text is your friend's  
name, From address is the spammer's address and Reply-to is the address  
of the compromised system.


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
Quoted text here. Click to load it

No, they aren?t. Bounces are sent to the transport-level sender address
(often called the ?return path?).

--  
https://www.greenend.org.uk/rjk/

Re: Regexes and C
On 20/03/2020 10:19, Richard Kettlewell wrote:
Quoted text here. Click to load it
'envelope from'

--  
Climate Change: Socialism wearing a lab coat.

Re: Regexes and C
Quoted text here. Click to load it

Yes, that too. Delivery agents often add it as a Return-Path: header as
their last act.

--  
https://www.greenend.org.uk/rjk/

Re: Regexes and C
On 20/03/2020 12:26, Richard Kettlewell wrote:
Quoted text here. Click to load it
No, they do not. Ever.




--  
?when things get difficult you just have to lie?



Re: Regexes and C
Quoted text here. Click to load it

$ grep -c ^Return-path  ~/mail/saved/2019
1159

--  
https://www.greenend.org.uk/rjk/

Site Timeline