AI and decompilation?

Thinking back to my first job, nearly 50 years ago now, when I had to dis-assemble DEC's paper tape BASIC interpreter in order to enhance it, I guess that dis-assemblers and decompilers must now be ten-a-penny, especially for programs running under Windows where the structure of Windows programs is well-known with an assumption that C was the source language?

But I wonder if Artificial Intelligence could, after being fed with numerous instruction sets, take a block of binary, and analyse its source without any prior knowledge of the instruction set?

I am particularly interested in the Binary Blob provided for Raspberry Pi computers, with a view to getting detailed knowledge of the video processors employed therein.

Reply to
gareth evans
Loading thread data ...

Now *that* would be an interesting AI project to see the results of. I'm pretty sure the answer to your question is "Nobody knows, please publish when you find out" or thereabouts.

There's plenty of training material available in the form of open source compiled for all sorts of platforms you just need to decide on an AI architecture that's up to the job (hopefully something short of Alpha Go Zero), build it (or rent it in "the cloud") and train it. It would still be useful if you had to train one for each instruction set (or family).

The biggest challenge would be comparing the source codes, but code that compiles to an equivalent binary would be good enough as long as it didn't cheat (create binary array and call it for example).

--
Steve O'Hara-Smith                          |   Directable Mirror Arrays 
C:\>WIN                                     | A better way to focus the sun 
The computer obeys and wins.                |    licences available see 
You lose and Bill collects.                 |    http://www.sohara.org/
Reply to
Ahem A Rivet's Shot

I think a lot of the problem is defining the question.

What do you want it to do?

Reply to
Pancho

On Mon, 4 Jan 2021 11:00:29 +0000, gareth evans declaimed the following:

Actually, I think the use of disassemblers et al has fallen away. Modern processors have so many peephole optimizations and out-of-order execution streams that converting an executable back to assembly source is almost meaningless -- and getting back to a high-level language is near impossible. One would have to be an expert at the assembly for a processor to have any chance of understanding the result.

--
	Wulfraed                 Dennis Lee Bieber         AF6VN 
	wlfraed@ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/
Reply to
Dennis Lee Bieber

The retro-computing guys - those who are fans of the MC6800 and MC6809 microprocessors anyway, anyway, seem to be getting a rather good semi- interactive disassembler up and running. So far it understands executables that run under FLEX, FLEX09 for both 6800 and 6809 and under UniFlex and OS9/level 1 and 2 on a 6809 and can automatically detect which OS the binary was compiled for. This is quite impressive, since all four OSen have very different API call structures despite FLEX09,UniFlex and OS/9 all running on the same chip.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

AF6VN DE G4SDW

But we Radio Hams thrive on such low level technicalities! :-)

73.
Reply to
gareth evans

I don't want it to do anything. I want to play at a low level with the thing ... large oaks from little acorns grow.

Reply to
gareth evans

Security experts have several very powerful disassemblers and decompilers they use for Intel/AMD/ARM processors.

formatting link

Reply to
Scott Lurndal

Well, in my last job I often used disassemblers. IBM z/OS. Very useful for understanding IBM code.

I can't see what out of order execution has to do with a disassembler. You disassemble executables.

Since I understand Assembler, I certainly got meaning out of it even if the original was an optimized HLL. You can see what services are being called.

--
Dan Espen
Reply to
Dan Espen

I suspect AI could be trained to do that, perhaps better than being trained to read English. Not sure if anyone has ever tried.

The info-sec people use disassemblers all the time, and don't limit themselves to compiled from C and intended for Windows binaries. They try to extract passwords and locate flaws in firmware for all sorts of internet-connected things. I recall Cybergibbons creating some tutorials in November or December. It was linked from his twitter account, but I didn't pay that close attention to where it was. A quick look at his blog and youtube didn't find them, but he's got a robust web presence.

Elijah

------ have you searched if anyone else has reversed engineered it already?

Reply to
Eli the Bearded

Play with what thing? What is an instruction set, what is the Binary Blob? Why do you need an AI?

Most compilers leave fingerprints on executables you don't need an AI to detect them. I remember decompiling in the early 80's but complex modern code can often be a challenge to naively reverse engineer a high level understanding from even if you do have source code. Take away sensible variable and function names and you are stuffed.

Reply to
Pancho

Somehow I think that we're not singing from the same hymn sheet.

Sorry.

Reply to
gareth evans

I've had more than one experience in putting those meaningful variable names right back. It's actually pretty easy, a somewhat rote process. Find the read input instruction. Since you know the layout of the input record, you now have labels to many of the references to that input area.

I think you can work out how to proceed.

--
Dan Espen
Reply to
Dan Espen

Without the source how do you know any meaningful variable names in the first place?

Reply to
Pancho

Apple essentially do this for their Rosetta 2 x86-to-ARM converter. They take existing x86 executables, which are likely generated by their Xcode LLVM compiler. They convert the assembly back into LLVM's intermediate representation, which is the idealised-assembly representation most of the compiler stages work on. Then they push that IR through the regular ARM LLVM backend, including optimiser stages, to produce 64-bit ARM executables.

It's not a language intended for humans to read, but it's high enough for the compiler stages to work on. Doing it this way avoids having to emulate any ARM instructions.

Theo

Reply to
Theo

There is an intermediate disassembler style that sits between a traditional disassembler and the mythical AI disassembler: that is the 'semi-interactive' type I mentioned. Since I know of at least one of these that is currently up and running I probably should have explained it better, so here goes:

What I meant by this is a disassembler that initially generates an assembly source file but doesn't just save it. Instead it shows that to the user in an interactive, scrolling display which allows the user to assign names to branch destinations, call targets and addresses of variables, while simultaneously storing these in a symbol table, which is also viewable, editable on screen and can be saved and later reloaded at the start of a future session.

Most importantly, at any point you can rerun the disassembly, but this time the disassembler will use the symbol table to include names in the symbol table in its output. IOW, after you've added one or more name/address pairs to the symbol table, rerunning the disassembler will incorporate these into the new version of the disassembled source. Working this way is obviously faster and less error-prone than saving the first pass disassembler output and manually editing it.

For extra points the disassembler should be able to:

- start by reading a predefined symbol set that contains the OS API names and names of OS public variables.

- be configurable to search for and read in more than one symbol set.

- use a modified version of the symbol table editor to add comments that will appear as comment blocks in front of a nominated address or after the address content as a trailing content.

- generate a disassembled source file that can be assembled without needing further changes.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org
Reply to
Martin Gregorie

I was going to say that disassemblers for IBM seem to work fairly well. I?ve used them a few times.

I think, for example, that one disassembler might recognize the SVC number.i think it put the macro name in as a comment (LINK, GETMAIN, etc.)

--
Pete
Reply to
Peter Flass

I dis a fun side project a few years back. The source for one module of PL/I(F) was chooched on the distribution tape, about the last third was missing. I disassembled the object module, and was able to recognize variable names and standard compiler macros. I got my restored version back to identical to the original, and also a fairly readable source.

--
Pete
Reply to
Peter Flass

The pieces of the hardware supported by the Blob.

The list of binary codes that tell the procesor what to do.

On the Raspberry Pi it is the non-Open-Source proprietary code that is provided by the chip manufacturer, including parts of the boot loader and the 3D drivers among other things.

Why not?

He's talking about something that you can give a pile of object code from an unknown source (I mean _really_ unknown--it could be for Z/OS or a VAX or Intel or Alpha or any other architecture, compiled from C or PL/I or Fortran or pick a language at random, with it figuring from there what the code does.

Reply to
J. Clarke

You start with the inputs and outputs and work into the algorithms and eventually maybe you can make sense of it.

Reply to
J. Clarke

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.