Integer maths is optimised with "-O1" or higher. It does not need separate flags. (There are other flags that can affect the way integer overflow is handled, but the default is to follow the C standards, and that is usually what gives the fastest results anyway.)
A 10000 number lookup table will give you the result in a single operation. Do you have 20K of flash (or ram) space to spare? I don't know your particular chip, but many Cortex devices actually have quite a lot of flash space, and while 20K for a square root table might seem excessive, it could work. And it would be the fastest possible method.
You can also do smart things with a smaller table - say, a 157 entry table for the square roots of (x >> 6), shifted left by 3 bits. Your algorithm is then to take x, shift right 6 bits, look it up in the table as y. The integer square root of x is then between y and y + 7. A simple algorithm to check these would take 8 checks, each of which is simpler than binary search rounds. (And the table takes less flash.) A smarter algorithm could probably jump to the correct y faster, based on (x >> 6) and (x & 63).
That would still be slower than using VSQRT, but is a possibility if you want to avoid floating point.