Watermarking

D

D Yuniskis 16 years ago

Hi,

Anyone with FIRST HAND experience deploying watermarking technologies? Comments re: static vs dynamic techniques?

Also, any FIRST HAND (not hearsay) experiences with its use in litigation -- that you can *discuss* openly? Or, pointers to cases of interest? I only know (first hand) of two and technology has changed considerably in the years since...

Any pointers to (market) perceptions of watermarking?

Comp.realtime included as to any insights regarding how RT characteristics could be exploited to provide other watermarking opportunities (not typically available to non-RT applications).

Finally, any pointers to techniques to circumvent these? And, the effort required?

Thx,

--don

Vote

W

whygee 16 years ago

Sorry, I have no experience in this, I've read several articles about this subject. However, you don't give any hint about your problem so : - what do you want to watermark ? - how ? - why ? - what is the connection with real-time ?

because from the little I know, there is not "one" watermark, there are as many as applications. And most can be easily circumvented, as soon as the watermark method is understood by the attacker.

good luck,

yg

http://ygdes.com / http://yasep.org

Vote

D

D Yuniskis 16 years ago

The application.

That was the point of the post :>

To track "leaks" from the preproduction release.

Because you can exploit temporal characteristics of the application to implement the watermark. Since nonRT applications *don't* have well-defined temporal behavior, trying to exploit that capability isn't robust in those arenas.

There are many different approaches to watermarking. It's essentially a steganographic problem -- except the amount of data is small.

*Trivial* approaches can be circumvented. E.g., rearranging the string space, etc. But, more dynamic techniques require a greater effort to reverse engineer the code to understand how the watermark is even implemented (since you never *know* -- without "insider information" -- what characteristics are being used to implement it).

E.g., you can readily see how the string spaces of two (uniquely watermarked) versions of the same executable differ. But, if the two applications look "entirely different" in their binary images yet "execute the same" (functionally), the effort to reverse engineer can quickly become comparable to the effort required to design from scratch. That, after all, is the easiest deterrent.

Vote

M

Mark Borgerson 16 years ago

This sounds like something that might be amenable to a custom linker or linker script. You could rearrange the application and library functions in the code space. That would require a reverse-engineering effort to rebuild the function call tree each time.

Mark Borgerson

Vote

D

D Yuniskis 16 years ago

Yes, that's one of the "static" approaches you can use.

E.g., the simplest approach is:

char *watermark = "This is copy #027 of the image";

Of course, take *two* images and do a bytewise compare and it doesn't take long to realize:

- the images differ

- *what* the difference is And, to deduce:

- how to remove any identifying information from *both* images!

Of course, your code could examine the watermark at run time and at least verify that it *looks* like a watermark. E.g., right number of characters before the terminating NUL, right context (perhaps all watermarks are of the form "This is copy #XXX of the image"), appropriate value for the "unique identifier" (e.g., "XXX" must be between "012" and "103").

You can build runtime checks that get increasingly more complex (e.g., storing hashes instead of simple strings).

But, all of those are trivial to recognize and defeat. E.g., set a breakpoint to detect any references to the

*watermark::*watermark+length(watermark)-1 region and then decompile the code that is playing with this data. Remove the code and/or come up with a *different* watermark that satisfies the criteria that the code imposes.

Not only is this an unreliable watermarking technique, but, it also allows the "thief" to shift blame onto a potentially innocent third party: "This is copy #028 of the image." If it is that easy to change an image's identity, then it is worthless (as it gives an accused party an easy defense: "Look how easy it is to *forge* some other watermark!")

You can rearrange the order in which modules are linked within the executable image. E.g., A, B, C, D, E vs. B, C, A, E, D. Given how many modules there are in

*most* products (especially anything "significant" enough to be nontrivially decompiled IN TOTO), you can easily get hundreds of bits of information from this approach.

If you have far fewer *legitimate* images than you have "linkage possibilities", you can distribute those valid linkage possibilities to maximize their Hamming distance. This gives you a better counterattack against the "look how easy it is to forge some other watermark" defense. E.g., like credit cards.

Diddling with the linkage editor makes two unique images

*very* different. "Nothing lines up" in a byte-wise compare (or, anything that does does so purely by coincidence).

This also increases the effort required to "decipher" and "undo" the watermark itself. There is nothing *explicit* AND ARTIFICIAL in the code that verifies the validity of the "module placement" (you *could* add this but that would just draw attention to it as it would be very superfluous in its functionality -- like checking the "This is copy #027..." -- it adds nothing to the overall functionality of the product).

A counterfeiter would have to trace the call tree in detail to identify each reference to a "module" (or its components thereof) in order to be *able* to shuffle them around (assuming the thief realizes that the images *are* watermarked and the nature of the watermarking).

A similar approach is to shuffle the locations of strings (or other constant data) within the "initialized data" segment of the executable. When dealing with *just* strings, the effect is very obvious -- it is easy for a human to recognize the differences in the files are

*just* strings. And, if observant, you can see that they have just been reordered. So, the thief is faced with a task similar to tracing the call tree -- though slightly different as you are now just tracing references to strings (or parts thereof) instead of the entry points of functions, etc.

Assuming the sizeof(pointer) is always constant (no special "short" addressing modes), then the executables in each of the above scenarios are the same size. Any exploits that the compiler can leverage could result in the size of the executables (two different images) differing.

All of these approaches are static, in nature. They are put into the image at compile time and can be "viewed" by simply examining the image "at rest" (static).

Other techniques create more transient watermarks -- the code has to be *running* to figure out what the watermark is (i.e., to observe it). These are harder to reverse engineer as they require watching a live system as it evolves.

If you add temporal guarantees from RT environments, then you add one more "difficult to constrain" variable to the mix!

The trick is finding a technique that is robust enough ("This is copy #..." is NOT) yet *simple* enough to implement (you don't want to burden *your* development effort unduly -- you want to burden the *thief's*!)

Vote

W

whygee 16 years ago

What you describes looks more like the techniques used by virus writers to escape detection by pattern matching. There is a lot of research in this domain, if the security magazine I read is right. Self-modifying codes, hashes, RSA keys, equivalent codes sequences (like a++ = a-= -1, a+=1...) etc. are well known in the security world. They are also supplemented with Virtual Machine and debugger detection and avoidance so when the code detects that it is being single stepped, it branches to dumb stupid code that does nothing, or executes differently.

however these "protection layers" often have a big price in execution performance, maintainance and bloat...

yg

http://ygdes.com / http://yasep.org

Vote

W

whygee 16 years ago

OK then let me clarify :

you *can* leave a mark in your binary. One simple example is : you select 60 (or more) locations in your code where for example you increment a variable, or add a constant. In each of them, you define (#ifdef) that the operation is done with adding, or else with substraction of complementary value.

so your source code has a lot of #ifdef WATERMARKxyz a += 42; #else a -= 42; #fi

you can then make a script that goes through all the WATERMARK000, WATERMARK001, WATERMARK002, ... and makes a binary combination based on a given serial number, like :

define an arbitrary client number, store it in your database. For eache client, apply a DES block cypher and/or Hamming code to his client number to provide more strength and resilience to the value that you will then encode.

Then you give this set of defines to the compiler who will hopefully not optimise the subs with adds. And you have your watermarked executable.

The issue is that if someone happens to have 2 or more different copies of your code, he'll be able to XOR them and forge a new executable. Hence Hamming and DES, the more you put watermarks, the more chances there are that some original marks are not uncovered by the XORs. For example : - with 2 XORed copies, 1/2 of the marks can be removed - with 3 copies, only 1/4 of the marks remain... - etc. each new found copy leaves less and less marks. so redundancy is paramount if you want to be sure to retrieve your data in case of conflict.

Furthermore, if you have more choices than just 2 different ways to do the same operations, it's even better as it might confuse the differential attack even more.

contact me in a private message for my consultancy fees ;-P

yg

http://ygdes.com / http://yasep.org

Vote

D

D Yuniskis 16 years ago

I'm not sure this could (by itself) be of any "help", there. I.e., the goal isn't really to disguise what you are doing. Rather, just to change the way it's "packaged". You would, for example, see *huge* chunks of identical code (assuming it was written as PIC) that was just "located in a different place" in the executable.

E.g., you would still find "This is copy #..." *somewhere* in the executable -- it just wouldn't be in the same place in each copy.

(My understanding re: virus writers want to *disguise* "This is copy #..." ?)

Again, I think that's a different beast. I think that is just intended to complicate reverse engineering. The (primary) goal, here, is just to make each instance of the executable slightly different -- yet functionally identical (i.e., in an RT application, it still has to execute in exactly the same time, etc. -- this can be an issue as locality of reference can invalidate cache lines with different frequencies from one "copy" to the next.)

The instances of which I have first-hand knowledge made that very apparent. :> But, they went beyond simple watermarking to also try to thwart copying. I'm not trying to prevent that -- just *track* where the copy came from. (If client wants more, they'd have had to *ask* for more! :> )

And, to make it difficult for a thief to alter the image in a way that would enable him to make it look like the original (counterfeited) image came from a *different* instance (E.g., "This copy was issued to John Doe")

Note that this doesn't have to add any bloat -- since the goal isn;t to mask your actions, just "permute" them.

Vote

D

D Yuniskis 16 years ago

Yes, but this clutters up your code -- i.e., adds to the complexity that *you* have to manage. It adds (noticeably) to the cost of developing and maintaining the code as you are adding "stuff" that isn't really contributing to the algorithm itself.

But your example can't be easily optimized away -- it isn't part of the algorithm itself -- just "excess baggage". (i.e., what role does "a" play in the normal execution of the algorithm? Why can it be "+ 42" in this image and "+ 0" in some other -- without affecting the algorithm itself?)

Yes -- *if* you use trivial static differences. This was the point I was making with my "This is copy #027..." example -- if you can get your hands on some other copy (e.g., #075), then you can compare the two and the

*difference* (between just those two instances) becomes readily apparent.

OTOH, if you scramble the order of constants (e.g., initialized data area) in the different images, then one might be:

yyyyXXXThis_is_copy_#027_of_the_image.ZZZZ

while another is

XXXThis_is_copy_#029_of_the_image.ZZZZyyyy

I.e., a diff(1) will turn up lots of differences. And, it isn't immediately obvious how those differences correspond with each other. (assume XXX, yyyy and ZZZZ are binary values) E.g., is "XXXThis_is_copy" an entity? And, "_#02" another? That just *happen* to be consecutively located in memory? Or, is "XXX" one entity, "This_is_copy_#02" another?

Imagine XXX yyyy ZZZZ etc. are each *modules* (code fragments) *in* the executable. I.e., each may be thousands of bytes. Its a lot harder to recognize where the boundaries of those objects lie when looking at any (small) number of images -- especially as they may *change* based on their location:

E.g., any "jumps" will have different target addresses so "JUMP HERE" will have different representations depending on where "HERE" happens to be in this copy of the image.

But, again, you are adding complexity to the code that doesn't serve the goal of the original algorithm. That;s the "add bloat and maintenance hassle approach." :> You are performing extra actions in your code that are there primarily to obfuscate or "serialize" (watermark) the image. So, you are making your job incrementally harder.

The goal (ideally) is to come up with simple transformations and techniques that can be recognizable as "distinct instances" of the same original algorithm *without* adding bloat.

E.g., if I scramble the order of auto variables in a function declaration, I can (in theory) create two different binaries that operate identically (in practice, the compiler can chose to reorder these for me :< ).

So:

void functionX(void) { int a; int b; int c;

a = b; b = 2; c = a; ... }

*can* generate different code than:

void functionX(void) { int b; int a; int c;

a = b; b = 2; c = a; ... }

(there are risks to this but lets ignore those).

This transformation is easy to do -- you can preprocess the source file before compilation to "rearrange" the declarations -- and it doesn;t detract from your efforts to implement the "original application". If you decompose (either explicitly while writing or implicitly with a pre-processing tool) the module into discrete "functions" (execution units), then you can reduce the work that the compiler has to do by compiling each "version" of each function only once. I.e., the above example would result in two files -- functionX_order1.c, functionX_order2.c -- which could each be compiled *once* and then *one* selected to be passed to the linkage editor to build a particular instance of the image.

Contrast this with having all (many) of the functions in one source file and having to preprocess all "5,000" variations (for each image instance!) of that source file -- 5,000 compiler invocations, etc.

I.e., these sorts of transforms give the end result desired (watermark) without interfering with the developer's actions -- yet make it considerably more difficult for a thief to permute any instance into any other instance WITHOUT A SERIOUS INVESTMENT OF TIME/EFFORT.

Vote

W

whygee 16 years ago

hi !

In the end, we will all be eaten by the Chinese, then they will eat themselves. So why care ? ;-)

anyway, I see your points. Hope you'll succeed, yg

http://ygdes.com / http://yasep.org

Vote

O

Oliver Betz 16 years ago

has your hardware external program memory or do they run from internal flash? If the latter, distribute encrypted binaries.

Oliver

Oliver Betz, Munich despammed.com might be broken, use Reply-To:

Vote

D

D Yuniskis 16 years ago

Encrypting tries to *prevent* counterfeiting. That's not the goal (too often, there are ways to access internal resources by exploiting something in the device).

Stated goal is to track where "copies" came from. Presumably, prosecute the source of the leak -- who probably doesn't have very deep pockets (whereas counterfeiter *might* -- and might be overseas, etc.) and would be seriously concerned to see him/herself named in such a lawsuit.

(I suspect just the threat of this is the bigger deterrent. Dunno, I'm not the client)

Vote

O

Oliver Betz 16 years ago

Hello Don,

as far as I understand, that's exactly the goal. You don't want someone to change the "watermark".

These exploits are likely more expensive than circumventing the other watermarking methods cited in this thread.

That's trivial if you distribute encrypted files. The ID ("watermark") can be in the encrypted file ((part of) the cipher's initialization vector) and in the application showing up somewhere in the user interface or the update routine (e.g. boot loader).

Oliver

Oliver Betz, Munich despammed.com might be broken, use Reply-To:

Vote

D

D Yuniskis 16 years ago

Let me rephrase: encryption prevents copying (which would be an added bonus!) The goal here is not to prevent but, rather, to *track* where a copy originated. Preventing copying is often harder to accomplish; and, is very obvious to the potential copier that you have taken measures to try to thwart that. OTOH, watermarking need not "announce" itself to the thief. He perceives a copyable product. He makes his copy and his copy *works*, etc.

For "small fish", you are correct. But, for big players, you would be surprised at how quickly and easily a product can be copied -- *if* you don't have to understand how it works! (I won't go into that here; google is your friend :> )

OK, so you are watermarking *then* encrypting. But, you still need a robust means of watermarking the executable (that can't be easily altered, thwarted, etc.) *That's* the issue I am trying to address.

Vote

O

Oliver Betz 16 years ago

Hello Don,

It makes it harder to copy the whole device or disassemble the executables, but we make accessible (encrypted) firmware files freely to our customers. Therefore...

...that's also true for our devices (unless you mean by "copyable" that the customer is also allowed to copy the hardware).

[...]

Since the executable is not directly accessible to the customer, I simply could put a notice e.g. in the boot message. The customer only sees the encrypted file and the bootloader will not accept it if it's changed.

Oliver

Oliver Betz, Munich despammed.com might be broken, use Reply-To:

Vote

G

George Neuner 16 years ago

I've been following your conversation with whygee. No watermarking scheme is foolproof or unforgeable, and marking an executable is harder because it isn't possible to fuzz code like data.

I've read about "functional marking", changing the program's runtime behavior based on some key ... for example, strobing indicator lights in a pattern that's hard to discern by eye but could be clearly seen using a movie camera. But I don't see an easy way to do something that would remain non-obvious if the program were disassembled.

I agree with whygee that the best way to mark an executable is to abuse equivalent code sequences. However, I think it should be done

*after* compilation, using a patching tool and starting from the unaltered binary. I also agree that the patches should be based on some (well distributed) customer id - perhaps a crypt hash on the customer information.

You want to substitute code sequences of equal length (unless PIC code makes it unnecessary). As an stupid example, it won't do any good to replace an addition with a subtraction if you then also need to set/clear a carry bit for the subtract instruction to work. You need to find a set of substitutions that are safe to make and which won't affect other code, but it helps that the substitution sequences can be specific to the binary being marked.

In many ISAs it does not affect the length of the code sequence to negate a comparison by reversing the operand order and then to branch on the negative result. Obviously you don't want to do this just anywhere because it potentially could mess up a multi-way branch, however, you can safely negate/reverse any 2-way branch. During coding, you could manually tag good candidates with a label and then pick the labels out of a map file or assembler listing of your release compile.

George

Vote

W

whygee 16 years ago

In practice, post-patching is more difficult than source-code #defines.

My previous idea does not make any assumption about the target architecture and C does not use carry bits so there is no risk of borking the binary. Sure, it is a bit more cumbersome for the source code but I have done worse... A "macro" could help there :

#define ADD_IMM(src, imm, dest, define) \ #ifdef define {dest = src + imm;} #else {dest = src - (-imm);} #fi (a m4 script could be better)

To help "calibrate" the detection routines, it sounds interesting, after the source code is ready, to compile with all #ifdef set, and make another binary without any #ifdef. The "XOR" of the two results will show where all the ADDs are, with the added bonus that the -imm xor imm will show up as plain 0xFF(FF(FFFF)) :-) plus a bit before for the opcode.

That would be an interesting hack to try...

Now, looking at code I wrote today, I see very few constant adds. But I see a fair amount of constants, which opens another door... For example, imagine a system call :

syscall(42)

42 can be decomposed in myriads of ways, both by addition and substraction, or even XOR or AND. You need 2 operands and 1 operator. Store the 2 operands in volatile ints (so they are not optimised away), they provide one int of entropy (the other int is a linear combination based on the first int). Two more bits are provided with the combining operation : and, xor, add, sub. Here again, macros and/or m4 will be useful. A sed script can even parse the C files and build a .h from the macro names it has found.

And it's still independent of any ISA.

OK, I stop here, I feel too tired to think right, but you see the idea (I hope)

yg

http://ygdes.com / http://yasep.org

Vote

D

D Yuniskis 16 years ago

Yes. In my case, I am acknowledging that copying *will* (probably) take place. And, just trying to track where the copies originated. (client's goal)

Think of company that blatantly *steals* your design. (this is far more commonplace than you would think!)

*Copying* hardware and software are easy if you've a mind to do so. OTOH, *understanding* a design in enough detail to be able to *change* it to something *equivalent* JUST to disguise the fact that you copied some other product requires considerably more effort. (First, you have to *know* this to be the case -- "Why do these N devices all have slightly different firmware images? Are they different versions? If so, which is the "most advanced"? etc.)

Presumably (though not a guaranteed fact), all of your images are identical (i.e., the *decrypted* version). I.e., if I picked up 2 of your devices and reverse engineered them, I would see two things identical "under the hood", right?

So, if I *copy* the device-as-ready-to-accept-an-encrypted-image, I can use any of your future released images. And, you can't tell *which* particular device I used as the template for my original "copy".

The goal here is to acknowledge that copying *is* possible (and affordable). But, to try to track where a "leak" may have occurred during the development/alpha/beta program. E.g., this puts a lot more pressure on those testers as you can now harrass "the leak" *personally* when/if a copy turns up.

(I suspect trying to track down sources of copies after formal product release is not a concern. Rather, you would be quite annoyed to find that -- not only has your product been *copied* but the copy is commercially available

*before* your "original" is!) [I am making educated guesses, here, as to the actual reasons behind this design criteria :< ]

Vote

D

D Yuniskis 16 years ago

Yes and no. To some extent, data manipulations are *easier* to observe and manipulate. But, code can also be "marked"; after all, its just "data" interpreted by a state machine (known as the CPU).

No, I am not talking about anything that has to be obvious to a "special observer" -- other than an observer that can disassemble (meaning "decompose into small physical pieces") the device in question and compare it to a "template" suspected of being the original for the copy.

What you want is something that an observer with two (or more) instances (avoiding the term "copy") of an executable will recognize as "different" -- but, won't be able to easily figure out how to convert either of them into "yet another" instance that retains all of the original functionality.

But those are trivial to identify. I.e., they have to occupy the same space in all "incantations". They will tend to be confined to very small pieces of memory (huge code sequences get harder to manipulate while still satisfying any control transfers out/in). And, they will tend to be simple -- analysis will make it readily apparent that the changes are "meaningless": "Here he adds 5; there he subtracts -5. "

But, you see, that is exactly the sort of thing that makes this approach trivial to circumvent.

Imagine, for example, compiling each instance with a different compiler and linkage editor. Or, with a different level of optimization. etc. (this won't work, either, because it can make *big* changes to the performance/requirements of the device). I.e., in each case, its the same "product" but the code images look very different.

I don't see this as a realistic way forward. It puts too much burden on the developer. And, doing it as a post process means the tools *developed* to do it would be complex -- would they introduce bugs, etc.

I think, for a given, known toolchain, you could get the results I need just by clever manipulations of the sources -- *without* the participation or consent of the developer (i.e., by fudging things that he technically "can't control" as inherent in the language specification).

I think I'll try wrapping some preprocessor directives around select code sequences and building some M4 macros to massage the sources as I would think they could be. Then, look at the results.

Vote

O

Oliver Betz 16 years ago

Hello Don,

[...]

yes, I don't use any "watermarking".

yes (if they have the same firmware revision).

but I expect copying to be as expensive as reverse engineering any watermarking you are thinking about.

Have you any numbers about the cost to get the content of a flash microcontroller if it's "copy protection" is used? For example, we are using Freescale 9S08, S12, Coldfire V2 and I could also imagine to use a STM32.

Oliver

Oliver Betz, Munich despammed.com might be broken, use Reply-To:

Vote

Watermarking

Join the Discussion

Didn't find your answer?