Regexes and C - Page 3

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
Re: Regexes and C
Quoted text here. Click to load it

Wow, that's garbage.

Just go with: it has at least one @ and is not longer than [some number
which is smaller than what you expect to be the minimum buffer size, e.g.
250]


Re: Regexes and C
Quoted text here. Click to load it

Hmm.  Not on my system:

: gaja; cat re.c
#include <sys/types.h>

#include <regex.h>
#include <stdio.h>
#include <stdlib.h>

const char *RE = "^[a-zA-Z0-9][.a-zA-Z0-9_-]*@[a-zA-Z0-9][a-zA-Z0-9.]*[a-zA-Z0-9]*$";

int
main(int argc, char *argv[])
{
        regex_t re;

        int err = regcomp(&re, RE, 0);
        if (err != 0) {
                char errbuf[128];
                regerror(err, NULL, errbuf, sizeof(errbuf));
                fprintf(stderr, "regcomp failed: %s\n", errbuf);
                return EXIT_FAILURE;
        }
        for (int i = 1; i < argc; i++)
                if (regexec(&re, argv[i], 0, NULL, 0) == 0)
                        printf("The string %s matches\n", argv[i]);

        regfree(&re);

        return EXIT_SUCCESS;
}
: gaja; make re
cc -O2 -pipe    -o re re.c
: gaja; ./re 'a snipped-for-privacy@d.e'
: gaja; ./re ' snipped-for-privacy@d.e'
The string snipped-for-privacy@d.e matches
: gaja;

Note that 'a snipped-for-privacy@d.e' did NOT match.

Quoted text here. Click to load it

My system includes this in regex(3), when discussing newlines:

     REG_NEWLINE     Compile for newline-sensitive matching.  By default,
                     newline is a completely ordinary character with no
                     special meaning in either REs or strings.  With this
                     flag, `[^' bracket expressions and `.' never match
                     newline, a `^' anchor matches the null string after any
                     newline in the string in addition to its normal function,
                     and the `$' anchor matches the null string before any
                     newline in the string in addition to its normal function.

That is, newlines are ordinarily treated like any other line.

Quoted text here. Click to load it

Are you sure you're matching against the string you think you are?
In particular, are you sure the string your program is matching
against actually contains a space?

Quoted text here. Click to load it

You don't have to program in C++ to use RE2.  Just be able to link
against a program that is written in C++.

Quoted text here. Click to load it

I think the latest version of "Programming in the Unix Environment"
is quite good.  It has been kept up to date since the unfortunately
premature death of W Richard Stevens.  I don't recall whether it
covers regular expressions, though.

It's been many years since I have used a book for that kind of thing,
so I'm afraid my recommendations for specific texts are dated.  :-(

    - Dan C.


Re: Regexes and C
On Thu, 19 Mar 2020 19:50:54 +0000, Dan Cross wrote:

Thanks for your example code - clarified a couple of  parameter  
definitions that where rather unclear in the manpage and I therefore  
didn't understand. regerror() now works and so do the  ^ and $ endpoint  
anchors.

My code was somewhat larger than an SSCCE due to other stuff I wanted it  
to do, which is why you didn't see it.

SSCCEs don't seem to be much known outside Java circles, so here's a  
reference: its actually a very good idea that can be used with almost any  
computer language:

http://sscce.org/

Quoted text here. Click to load it
Sounds useful. Note made to look at it.
  
Quoted text here. Click to load it
Call me old-fashioned, but I find a good book a lot more usable for  
reference than any reasonable sized screen or e-book I've seen so far.  
I'm likely to hold that view until the standard E-book becomes rather  
more booklike, i.e. with two double-sided pages used for, from front to  
back:  

Contents, two facing pages for the main content, and an index page.  

This would let a reader find stuff fast because content and index pages  
wouldn't have to be scrolled to all the time. If the whole thing is thin  
and uses e-ink displays, so much the better.


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
On Thu, 19 Mar 2020 13:18:58 -0000 (UTC)

Quoted text here. Click to load it

    The perl book on regular expressions includes one that validates
email addresses - it is extremely long (more than a page IIRC), they are
astonishingly difficult things to validate.

    There's this site https://emailregex.com/ which has what purports
to be a good regex for the job:

(?:[a-z0-9!#$%&'*+/ =?^_`~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`~-]+)*|"(?:
[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\
[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+
[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9]
[0-9]?)\.)(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:
[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\
[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

    You might need pcre (the perl compatible regex library).

--  
Steve O'Hara-Smith                          |   Directable Mirror Arrays
C:\>WIN                                     | A better way to focus the sun
We've slightly trimmed the long signature. Click to see the full one.
Re: Regexes and C
Ahem A Rivet's Shot wrote:

Quoted text here. Click to load it

FSVO "good"

Re: Regexes and C
On March 19, 2020 09:18, Martin Gregorie wrote:

Quoted text here. Click to load it
[snip]
Quoted text here. Click to load it
[snip]

1) To anchor a regex, use the '^' and '$' metacharacters. '^' matches the  
empty string at the start of a line, and '$' matches the empty string at the  
end of the line.

2) There is no regex that can validate email addresses with 100% certainty.  
You /can/ write a regex that will come close, but there will be valid  
outliers that your regex will call invalid.

3) The RFCs describe exactly what an email address can consist of. You want  
to study, at least, RFC 5322 section 3.4  
(https://tools.ietf.org/html/rfc5322#section-3.4 )

HTH
--  
Lew Pitcher
"In Skills, We Trust"


Re: Regexes and C
On Thu, 19 Mar 2020 13:18:58 -0000 (UTC), Martin Gregorie


Quoted text here. Click to load it
https://www.regular-expressions.info/email.html


--  
    Wulfraed                 Dennis Lee Bieber         AF6VN
     snipped-for-privacy@ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/

Re: Regexes and C
On 19/03/2020 13:18, Martin Gregorie wrote:
Quoted text here. Click to load it
No. That's why I don't bother with regex, ever.

Its far faster for me to write a series of tests in 'C' then try and  
work out what random gobbledygook will do the job in regex.

Regex is for nerds to impress other people with. Its not a smart way to  
program.

Same as SQL. By the time you have taken a day to write the SQL query  
that does everything you want, only to realise it takes 50 minutes to  
complete, you could have written most of  it in C and got it down to 3  
seconds...

--  
?Those who can make you believe absurdities, can make you commit  
atrocities.?



We've slightly trimmed the long signature. Click to see the full one.
Re: Regexes and C
On Thu, 19 Mar 2020 17:48:31 +0000, The Natural Philosopher wrote:

Quoted text here. Click to load it
I won't argue with that.
  
Quoted text here. Click to load it
Here I DO disagree.  

SQL is fine unless you insist on writing huge, do-everything queries. I  
was involved with one of them years ago as part of a benchmarking  
exercise and that pretty much put me off writing that sort of thing for  
good. Never tried SQL procedures either, but concise SQL queries used  
judiciously within logic written in C or Java work very well and are easy  
enough to write and maintain.

Most SQL performance problems, IME anyway, boil down to crap database  
design, meaning bad or nonexistent normalisation and incorrectly placed  
or missing indexes. But, given that a relational database has a decent,  
user-friendly query analyser and there's enough realistic test data its  
generally quite simple to get the speed up to where it should be.  

Of course, if the DBA can't normalise and doesn't understand an ERD,  
and if the system designers only provide small amounts of largely  
imagined data rather than a few hundred or thousand actual business data  
items, then OF COURSE the database performance will be crap.  

Don't ask me how I know that: I've been called on too many times to fix  
that sort of mess. But sometimes the clients got it right. In one project  
it was very nice indeed to be given half a million records of valid test  
data. That was for the last major DB I worked on: I did much of the  
design and then tuned it using that huge pile of actual data. It really  
sang from the off.


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
On Thu, 19 Mar 2020 20:13:33 -0000 (UTC)

Quoted text here. Click to load it

    One more in a mature system - feature creep. The database was fine
for the original spec but nobody optimised for the new queries and tables
that got added for new features and worked fine testing the new features
against live data - it's just a pity what it did to the performance
under load once they got heavily used.

--  
Steve O'Hara-Smith                          |   Directable Mirror Arrays
C:\>WIN                                     | A better way to focus the sun
We've slightly trimmed the long signature. Click to see the full one.
Re: Regexes and C
On Thu, 19 Mar 2020 21:20:38 +0000, Ahem A Rivet's Shot wrote:

Quoted text here. Click to load it

Yep, that would do it too, but its really just another case of same old,  
same old as either the original design culprits are still there, or  
they've moved on, leaving a largely undocumented system for the new guys  
to sort out. Or, of course, worse still, maintenance and enhancements  
have been outsourced.  

My point being that its often the same mistakes being warmed over again.

I'm lucky in that, back when IDMS was a thing, I got some really good  
training on its care and feeding. I largely picked up relational  
databases on the job, but a surprising amount of what I learnt about IDMS  
design was also relevant for RDBMS, especially the preliminaries: data  
normalisation and using with Entity-Relationship diagrams to design the  
schema.  


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
On Fri, 20 Mar 2020 01:29:06 -0000 (UTC)

Quoted text here. Click to load it

    All too true.

Quoted text here. Click to load it

    Hmm similar, I cut my DBA teeth on a thing called MDBS-III, a
network database engine that ran on CP/M, MP/M and MS-DOS[1]. Getting the
schema right really mattered because changing it was a *pig*.

[1] It wasn't until the AT that the MS-DOS version was the fastest.

--  
Steve O'Hara-Smith                          |   Directable Mirror Arrays
C:\>WIN                                     | A better way to focus the sun
We've slightly trimmed the long signature. Click to see the full one.
Re: Regexes and C
On 19/03/2020 20:13, Martin Gregorie wrote:
Quoted text here. Click to load it

exactly what I said. Dont use SQL to do complex stuff - its very hard to  
get the syntax right and it runs like a dog with three legs ampurated.

Quoted text here. Click to load it

No. In the case where I did the biggest job- it was normalising a flat  
database of a few million UK postcodes into a relational one - none of  
these were the problem.

What was the problem was Mysqls inability to create good optimised  
machine code out of SQL statements. Unlike - say - moderb C compilers  
which astound me in their ability to write better assembler than I could  
myself, MySQL is like going back to the first 8 bit C compilers I used.

  But, given that a relational database has a decent,
Quoted text here. Click to load it
On simple queries, yes, bit not on complex ones involving conditional  
selections of selections etc.

Quoted text here. Click to load it
When I had fished what I wanted ran well, with over a million records  
but it did not use complex queries.

Creating it from the data I started with would have, if I hadn't given  
up trying to do the whole job with SQL and restricted myself to simple  
queries, building enormous linked lists in C - over a gigabyte in size -  
and thinking hard about how I would access the contents.

--  
?People believe certain stories because everyone important tells them,  
and people tell those stories because everyone important believes them.  
We've slightly trimmed the long signature. Click to see the full one.
Re: Regexes and C
On Fri, 20 Mar 2020 06:38:41 +0000, The Natural Philosopher wrote:

Quoted text here. Click to load it
... and shouldn't have been one.
  
Quoted text here. Click to load it
I'm not surprised. MySQL was known as being a limited system which lacked  
any form of query optimisation. It and MS Access were both known to be  
very limited, especially when the data volume gets large.

The original big three were Informix, Ingres and Oracle, with IBM joining  
in later, initially having led the field with System/R, developed by Ted  
Codd and Chris Date. Incidently, both have written extremely good books  
about the care and feeding of RDBMS systems.  

Oracle has always been expensive and seems to need a lot of routine  
attention, or so I found when I briefly looked after a site.

I know very little about Informix, never having used it.

Ingres was always pretty good. Quick, easy to manage and with a decent  
query optimiser. There was a special University license which was cloned  
and became PostgreSQL, which is excellent, free and is currently  
maintained and developed. It has a good query optimiser and can be  
ignored for weeks or months on end - it just quietly gets on with  
automated housekeeping, etc.  

Ingres also sold a developers license for version 10 to Microsoft - this  
is where Microsoft SQL Server came from.

Quoted text here. Click to load it
Try PostgreSQL next time. You'll be pleasantly surprised.
  
Quoted text here. Click to load it
Indeed. It was, after all, only MySQL.
  
Quoted text here. Click to load it
I've done much the same in Java rather than using the Derby RDBMS, but  
that was only because I wanted a small and fairly simple in-memory  
database behind the covers of a club rostering system I wrote for my  
gliding club. The translation from RDBMS terms to Java looks like this:

Row  -> Class with getters, setters and some table-level methods in it

Table -> ArrayList<Class>  

Index -> TreeMap

and, before you ask, yes I did normalise the data first and then draw an  
ERD before cutting any code. It also implements a number of rules about  
minimum gaps between duties, not rosterinf members of a glider syndicate  
on the same day, etc, etc. Performance is good, with no delays noticeable  
during normal duty allocation/deallocation/moves or in switching between  
rosters.  


--  
Martin    | martin at
Gregorie  | gregorie dot org


Re: Regexes and C
On 20/03/2020 10:42, Martin Gregorie wrote:
Quoted text here. Click to load it

And what a bloody good job MS' code minions have done. Runs like the  
clappers, stays running and the best bit of all, runs on Linux. So you  
can now run the engine on any old small cloud instance and use all the  
sexy tools (SSMS et al) on Windows. No need for (relatively) expensive  
Windows hosting any more.

Like everything DB, get the schema right, get the queries right, get the  
indexes right and it's a really, really solid system.

Mine runs fine in a 2GB mem, 2 core Xeon, 15GB SSD cheap VPS with 5500  
active users (not simultaneously!) with access controlled by a load  
balanced set of four RESTful API servers, three have API in C#/.NET Core  
on Linux and one runs Node.js + FreeTDS on Linux. A beautiful marriage  
of cross platform, cross-ideology that just works.



Re: Regexes and C
On Fri, 20 Mar 2020 10:42:51 -0000 (UTC), Martin Gregorie


Quoted text here. Click to load it

    M$ Access is a GUI front-end (query and report writer) to the JET
database engine. JET being a "file-server" database (every application was
directly opening/managing the database file itself). Access later grew an
extension allowing one to use it as a GUI front-end for SQL Server/MSDE
databases (at the time, this was the difference between an MDB and ADP
[Access Data Project]).



--  
    Wulfraed                 Dennis Lee Bieber         AF6VN
     snipped-for-privacy@ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/

Re: Regexes and C
On Thu, 19 Mar 2020 17:48:31 +0000

Quoted text here. Click to load it

    In this case regex is not the problem, the problem is that email
addresses are not designed to be parsed.

Quoted text here. Click to load it

    They're a tool like any other, useful when they help not so much
when they get in the way.

Quoted text here. Click to load it

    They have their uses - dismissing a powerful tool is not a smart
way to program either.

Quoted text here. Click to load it

    Complex SQL queries are often a mistake - but at least SQL is
reasonably consistent for the simple stuff.

--  
Steve O'Hara-Smith                          |   Directable Mirror Arrays
C:\>WIN                                     | A better way to focus the sun
We've slightly trimmed the long signature. Click to see the full one.
Re: Regexes and C
Quoted text here. Click to load it

Yes, they are designed to be parsed and parsers for them exist (for
instance in most email software). The specifications have always
contained grammars for them. The language specified in RFC822 isn?t a
regular language, but that just means you need something a little more
sophisticated than a regular expression to parse it.

--  
https://www.greenend.org.uk/rjk/

Re: Regexes and C
On 19/03/2020 17:48, The Natural Philosopher wrote:
Quoted text here. Click to load it
  
Quoted text here. Click to load it

to 3  
Quoted text here. Click to load it

Tl;dr - I don't know how to do it properly, therefore it is crap.

---druck




Re: Regexes and C
On 20/03/2020 13:39, druck wrote:
Quoted text here. Click to load it

Quoted text here. Click to load it
No. The time to learn how to do it properly exceeds the time to do it  
the way I know so vastly that my life will be over before I *need* to  
learn it.

Quoted text here. Click to load it


--  
"I guess a rattlesnake ain't risponsible fer bein' a rattlesnake, but ah  
puts mah heel on um jess the same if'n I catches him around mah chillun".


Site Timeline