Thinking back to my first job, nearly 50 years ago now, when I had to dis-assemble DEC's paper tape BASIC interpreter in order to enhance it, I guess that dis-assemblers and decompilers must now be ten-a-penny, especially for programs running under Windows where the structure of Windows programs is well-known with an assumption that C was the source language? But I wonder if Artificial Intelligence could, after being fed with numerous instruction sets, take a block of binary, and analyse its source without any prior knowledge of the instruction set? I am particularly interested in the Binary Blob provided for Raspberry Pi computers, with a view to getting detailed knowledge of the video processors employed therein.

Now *that* would be an interesting AI project to see the results of. I'm pretty sure the answer to your question is "Nobody knows, please publish when you find out" or thereabouts. There's plenty of training material available in the form of open source compiled for all sorts of platforms you just need to decide on an AI architecture that's up to the job (hopefully something short of Alpha Go Zero), build it (or rent it in "the cloud") and train it. It would still be useful if you had to train one for each instruction set (or family). The biggest challenge would be comparing the source codes, but code that compiles to an equivalent binary would be good enough as long as it didn't cheat (create binary array and call it for example). -- Steve O'Hara-Smith | Directable Mirror Arrays C:>WIN | A better way to focus the sun The computer obeys and wins. | licences available see You lose and Bill collects. ...

AI and decompilation?

D

Dan Espen 5 years ago

The programs were reading our files. We already had record layouts for those files.

Dan Espen

Vote

D

Dan Espen 5 years ago

Yep.

One place I was working they had a lost source code program reconstructed from object code and they were complaining no one could work on it because of the variable and routine names.

Seemed easy enough to me and I fixed it up in a day or 2.

Dan Espen

Vote

R

Richard Kettlewell 5 years ago

Why would you do that instead of reading a reference manual for the target architecture?

https://www.greenend.org.uk/rjk/

Vote

A

Ahem A Rivet's Shot 5 years ago

The documentation for the GPU on the RPi has not been published, he seeks to reverse engineer it from the binary code that implements a published API on it.

Steve O'Hara-Smith | Directable Mirror Arrays C:\>WIN | A better way to focus the sun The computer obeys and wins. | licences available see You lose and Bill collects. | http://www.sohara.org/

Vote

T

The Natural Philosopher 5 years ago

Yes. I am certain that certain compilers and certain languages leave a fingerprint, Always THAT resister, used to do THAT job, always that particular sequence of assembly to mimic that high level construct. I cut my teeth on microprocessor assembly. The C. Some things that are neat in assembler are ugly as sin in C. Take a call table. In assembler, you set up a range of memory whose contents contain the addresses of subroutines. You load the accumulator with a number, left shift it once, add it to the content of a register set to point to the base of that memory block, and use that register as pointing to an address whose contents are the address you want to 'call' Simple, efficient and provided you ensure nothing out of bounds is in the accumulator, bomb proof.

Now try that in C, you need an array of pointers to functions, and a simple check on the index you engage, followed by a declaration to call the function whose address is in the array of pointers to functions. I never ever managed to get an 8 bit compiler to actually do that. People just don't call the contents of an array of pointers to functions.

Its easier by far to set up a switch statement, which takes care of out of bounds defaults, and ends up producing a chain of if..else if.. else conditional calls to hardwired functions.

That's how you write it, because its pretty much as fast on a pipelined processor, RAM is cheap and comprehensibility beats programming elegance hands down in the real world.

I've examined a lot of compiled machine code and its pretty easy to tell what language it is, and what roughly it was written as. Stack based variables is a bit of a give away pointing to C or a similar langauge. highly optimised compilers of course automatically obfuscate things, but that's the fun isn't it?

I gave up writing assembler for *86 CPUs when the Gnu compiler was patently doing a better job than I would in assembler, and the ability to write something long winded and easy to understand and have the compiler completely rearrange it and turn it into three lines of incomprehensible assembler, was to be respected.

I think it is up to a limited point entirely possible to make an AI that could replace machine code with editable and compilable source code. But there will always be the Problem Of Induction. Many many possible constructs in source using an infinite number of random variable and function names, could compile to the same object code. And there is no way to reinstate the comments either, so it becomes an exercise ultimately in hand editing and reinstating the comments manually - almost as big a job as writing from scratch.

I suspect this is how Linux writers write freeware drivers for proprietary hardware. Disassemble the manufacturers drivers, and at least mimic the program flow, if not the actual source code.

?I know that most men, including those at ease with problems of the greatest complexity, can seldom accept even the simplest and most obvious truth if it be such as would oblige them to admit the falsity of conclusions which they have delighted in explaining to colleagues, which they have proudly taught to others, and which they have woven, thread by thread, into the fabric of their lives.? ? Leo Tolstoy

Vote

T

The Natural Philosopher 5 years ago

+1001

"First, find out who are the people you can not criticise. They are your oppressors." - George Orwell

Vote

P

Pancho 5 years ago

Yes, I understand how you can disassemble a simple program. I did it myself in the 1980s.

However modern programs are much more complex. They are built upon many levels of indirection, libraries, composition, inheritance, function pointers, events, etc, etc... We use structure, design patterns and such like to allow us to recognise complex ideas quickly. That gets lost in compilation.

I just can't see how I would reverse engineer an understanding of anything but the most simple disassembly in any reasonable time frame.

Vote

T

The Natural Philosopher 5 years ago

Well you have hints. From what the code does...lets say you have code that loads data from two stack based memory locations adds them together and used then to access what is clearly an array, - that gives a strong hint that the original variables can be integers, and the index one is simply a temporary way to get a value into that array, so you call that 'i' or 'arrayIndex' pro tem...

Then once you have an idea as to what data that array holds, you can update it and the index to something more meaningful.

The whole process is actually covered in philosophy: It is the problem of induction. How do you work back from results to causes?

Given that the answer to Life The Universe and Everything was '42', what in fact was the question? (40+2)? (6x7)?

There are an infinite number of expressions that give that answer, and an infinite number that don't.

This is where Karl Poppers philosophy of science steps in. Instead of regarding there to be One True Reason why science works, namely that scientists are in the business of discovering the Truth, he pointed out that just because stuff worked (and 6x7 does indeed give 42) that was no reason to suppose that some other completely different construct might not work equally as well, and that had indeed happened with relativity and Newtonian gravity.

The Problem of Induction is that many theories can give the same predicted result. Sherlock Holmes is a sham. The Dog That Didnt Bark in the Night didn't bark, allegedly, because it knew the thief. Why? It might have been abducted by aliens, drugged, actually out hunting rabbits, in a soundproof box, or the Russians did it using a robot. or just too plumb wore out with old age to care.

The truth is not provable. All we have is stuff that works. Given running machine code, there are an infinite number of source codes that might have produced it, and an infinite number that did not.

We aren't there, ultimately, to reproduce *the* exact source, but to arrive at *an* editable source, that we can use. Like science, and religion, it doesn't have to be true, to be useful, and like science, and religion, its ultimate content will be forever truth-indecidable.

"First, find out who are the people you can not criticise. They are your oppressors." - George Orwell

Vote

R

Richard Kettlewell 5 years ago

I was under the impression it was a VideoCore IV, which appears to be sufficiently documented for GNU toolchain port.

formatting link

https://www.greenend.org.uk/rjk/

Vote

A

Adrian Caspersz 5 years ago

If that became possible, it would not be a far step for an AI machine to self-analyse itself or another AI machine. It could make clones and unwittingly modify them.

Who knows where that could lead, or what mutations could happen? Life?

The Chinese would be very interested in you.

I'm sure some of the architecture is provided in layers, some public like frame buffers and some not like acceleration features. So your machine code experiments could be done on the former, to learn to walk first. Or choose another more open graphics chipset if you need more documentation to get to first base. Perhaps there is on a low end mobile phone?

Here's a manual way of reverse engineering random chinese hardware.

[016] IT9919 Hacking - part 1 - Reading firmware with flashrom

formatting link

Your AI solution would have to replicate the ability of the human.

Adrian C

Vote

G

gareth evans 5 years ago

ISTR that my attack on the executable started by seeking out lines of code that might be subroutine calls, "JSR PC, address" in the PDP11 code. This served to create a number of identifiable and separate blocks from which to proceed.

Of course, this was much easier as it was a stand-alone paper tape program with no operating system underneath to muddy the water.

Vote

M

Martin Gregorie 5 years ago

+1

-- Martin | martin at Gregorie | gregorie dot org

Vote

G

gareth evans 5 years ago

Indeed!

I've discussed this before (And probably too often according to my biographers and stalkers! but I'm interested in computers for themselves, as wonderful complex machines, and not interested in what you can use them for.

My frustration lies with the Raspberry Pi series that come, for very little outlay of pennies, with a multi processor graphics chip which is believed to exceed the capabilities of the associated ARM processor but about which no detailed information is forthcoming.

Vote

G

gareth evans 5 years ago

Because no such manuals are available. The BroadCom GPUs are a closely guarded proprietary secret to hoi polloi.

Vote

G

gareth evans 5 years ago

That's an interesting and thought-provoking aside!

Vote

G

gareth evans 5 years ago

The first of those does not produce anything.

Does the second describe the GPU in some detail and describe the instruction set such that I might produce my own binary blob to do something completely different?

Also, AIUI, a different GPU has been incorporated into the

64-bit RPis.

Anyway, thanks for your input.

Vote

J

J. Clarke 5 years ago

Because there are features not described in the reference manual.

Vote

B

Bob Eager 5 years ago

One of my former colleagues did a Ph.D. on it:

formatting link

Using UNIX since v6 (1975)... Use the BIG mirror service in the UK: http://www.mirrorservice.org

Vote

T

Thomas Koenig 5 years ago

The Natural Philosopher schrieb:

One thing that is hard to do with C is to have different entries to the same function, something like:

bar: .cfi_startproc ... do something foo: ... do something else

ret

and then either call foo or bar.

Vote

T

Thomas Koenig 5 years ago

Adrian Caspersz schrieb:

The solution to the halting problem :-)

Vote

AI and decompilation?

Join the Discussion

Didn't find your answer?