AI and decompilation?

- C
- Charlie Gibbs
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 8:12 PM

9x6 - if you're working in base 13.

-- /~\ Charlie Gibbs | "Some of you may die, \ / | but it's a sacrifice X I'm really at ac.dekanfrus | I'm willing to make." / \ if you read it the right way. | -- Lord Farquaad (Shrek)

- C
- Charlie Gibbs
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 8:12 PM

I avoid that technique - it invites other stupid mistakes.

--
/~\  Charlie Gibbs                  |  "Some of you may die, 
\ /        |  but it's a sacrifice 
 X   I'm really at ac.dekanfrus     |  I'm willing to make." 
/ \  if you read it the right way.  |    -- Lord Farquaad (Shrek)

- C
- Charlie Gibbs
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 8:15 PM

This message has been brought to you by the Department of Redundancy Department (just down the hall from the Department of Incomplete

--
/~\  Charlie Gibbs                  |  "Some of you may die, 
\ /        |  but it's a sacrifice 
 X   I'm really at ac.dekanfrus     |  I'm willing to make." 
/ \  if you read it the right way.  |    -- Lord Farquaad (Shrek)

- G
- gareth evans
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 8:20 PM

+1

Why any coder would want a procname with different calling lists is beyond me.

To object to having x_procname and y_procname etc suggests a coder is not focussed on the matter in hand but is religiously adhering to some irrelevant convention.

- D
- druck
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 8:51 PM

Modern compilers of any language output a structured executable file, such as Portable Execution format for Windows and ELF for Linux.

That's two separate problems. The first is taking any block of binary and identifying if it contains an executable format of a particular processor architecture and OS.

The second is taking a known executable format, turning it in to a human readable form, such as a high level language - which doesn't have to be the same language it was written in.

That's a third problem. No matter how good your program is that identified and produces pseudo-source code, it needs someone to put in a huge amount of work to interpret and document the driver creating certain structures in memory and poking values in to registers.

---druck

- P
- Peter Flass
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 9:06 PM

The object code format would give you a clue, at least for most mainstream architectures.

--
Pete

- P
- Peter Flass
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 9:06 PM

You still have to set up the arguments for each in assembler, unless they all take the same arguments (or a pointer to an argument list)

You shouldn?t need declarations in C unless you?re using one of those new-fangled compilers that requires them. Old code should still be supported, though.

--
Pete

- V
- Vir Campestris
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 9:15 PM

I understand that the latest Pi is indeed a VC4.

Be aware that as a SIMD processor it's ... odd. Very odd.

That documentation ties up with what I remember about the device, and I wish I'd had it when I was working on it.

Andy

- D
- druck
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 9:20 PM

They certainly do, I wrote !ARMalyser to analyse RISC OS executables and

to aid the conversion from the old 26 bit ARM mode to modern Aarch32. It

was very obvious if Norcroft C, GCC or handwritten assembly had been used by looking at any chunk of the code, not just the obvious file heade rs.

t

rce code.

is no

I was not attempting to turn the executable in to a high level language,

but to give the user as much help understanding the assembler code as possible, to aid the conversion.

At the lowest level identifying what was code and what was data, easy in

well defined executable formats produced by compilers, but hard in handwritten assembler, which had often used every trick in the book to squeeze out performance on a 8MHz ARM2 with 512MB of RAM.

The next step was using knowledge of the Standard C Library functions and SWI APIs to annotate the registers passed and returned from the APIs

and where those registers contain static addresses, the data blocks they

point to.

To allow code to be modified with additional instructions to recreate flag preserving behaviour of the 26 bit code (in the few cases it is actually necessary) and data added to make the larger 32 bit file headers, all code and data addresses are identified and converted in to labels.

ARMalyser outputs in the standard Object Assembler syntax so it can be reassembled to produce an identical executable, and subsequently modified. It can also add syntax colouring in various formats such as XML, HTML/CSS for viewing.

If you were in marketing you could say the code which does this is 'AI',

but its really a huge chunk of tangled heuristics, which works well most

of the time, but occasionally miss-identifies code or data. Its a bit too eager to identify code, due to the tricks assembler programmers used, if I ripped all that out and only worked on compiler generated executables, it would be a lot more reliable.

---druck

- P
- Peter Flass
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 9:25 PM

Most architectures seem to be simpler than x86 with its mix of random instruction lengths. Start at almost any byte and a disassembler would probably be able to find a run of ?instructions? that don?t make any sense when examined by a human. Disassemblers I have worked with allow for human input to mark constants, for example, and allow them to be skipped.

--
Pete

- P
- Peter Flass
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 9:25 PM

Simple in PL/I, although it turns out there is more overhead than you?d think, particularly if foo and bar have different return types. I used multiple entries extensively in the Iron Spring PL/I compiler, but it turns out the ?package? construct (once I implemented it) is much cleaner. Multiple entries is also error-prone if the entries have different parameters.

--
Pete

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 10:23 PM

Its very useful indeed in Java: its often helpful to use the same name with different parameter lists for constructors and also for methods that all do similar jobs, e.g for outputting values from a class its helpful to use the same method name, with different parameter lists say:

getValue(String caption, int value); getValue(String caption, double value); getValue(String caption, boolean value);

where inventing a set of different method names adds little clarity to the code and about the only likely mistake is to try to use it with an unsupported type of value - which will cause a compilation error as you might expect.

Similarly, the ability to something similar in Algol 68, is equally convenient ana, IME anyway, doesn't make the code any more error prone or less readable.

In both languages, context selects the appropriate method or proc and using an undefined variation gets you a compilation error, despite Algol

68's tendency to widen numeric values to fit a parameter definition.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 10:27 PM

Last time I tried it, (about 2 months ago), the current GNU C compiler accepts the old K&R C first edition procedure declaration syntax. I wish more compilers worked this way.

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org

- G
- gareth evans
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 10:51 PM

Sorry but neither. I'm positing the problem of analysing binary when it does not feature in any known published format.

- G
- gareth evans
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 10:52 PM

In the case of the RPi GPU the format is completley unknown.

- G
- gareth evans
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Tue, Jan 5, 2021 10:54 PM

You were working on it? What can you tell us?

- C
- Charlie Gibbs
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Wed, Jan 6, 2021 12:14 AM

I write functions this way:

#ifdef PROTOTYPE char *foo(char *bar, int baz) #else char *foo(bar, baz) char *bar; int baz; #endif

One #define in a header file adapts it to any old or new compiler. It works for declarations too.

--
/~\  Charlie Gibbs                  |  "Some of you may die, 
\ /        |  but it's a sacrifice 
 X   I'm really at ac.dekanfrus     |  I'm willing to make." 
/ \  if you read it the right way.  |    -- Lord Farquaad (Shrek)

- C
- Charlie Gibbs
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Wed, Jan 6, 2021 12:14 AM

I find it easier to just cast "value" to a consistent type. But then, I'm a hidebound C weenie...

--
/~\  Charlie Gibbs                  |  "Some of you may die, 
\ /        |  but it's a sacrifice 
 X   I'm really at ac.dekanfrus     |  I'm willing to make." 
/ \  if you read it the right way.  |    -- Lord Farquaad (Shrek)

- E
- Eli the Bearded
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Wed, Jan 6, 2021 2:17 AM

Java gets that from C++ doesn't it?

Use a union type with an enum for which value in the value is the proper one. That can easily expand to new types that are not easily cast.

enum caption_types { CAPTION_UNDEF = 0, CAPTION_INT, CAPTION_FLOAT, CAPTION_RETURN };

struct int_value_t; struct float_value_t; struct return_value_t;

typedef struct { enum caption_types here; int value; } int_value_t;

typedef struct { enum caption_types here; float value; } float_value_t;

/* try casting this one to something meaningful... */ typedef struct { enum caption_types here; int (*value)(); } return_value_t;

typedef struct { enum caption_types here; } id_value_t;

typedef union { id_value_t id_value; int_value_t int_value; float_value_t float_value; return_value_t return_value; } compare_value_t;

typedef char String; /* unused here */

int getCaption(String caption, compare_value_t value) { int rv = -1; if (CAPTION_INT == value.id_value.here) { rv = value.int_value.value; } if (CAPTION_FLOAT == value.id_value.here) { rv = value.float_value.value; } if (CAPTION_RETURN == value.id_value.here) { /* use the return value of the function provided */ rv = (value.return_value.value)(); }

return rv; }

Elijah

------ should have used a swtich statement

- M
- Martin Gregorie
  
  Contact options for registered users
Vote on answer
posted
3 years ago

Wed, Jan 6, 2021 8:25 AM

That's safe but not necessary, for GNU C anyway.

The GNU C compiler series maintains backward compatibility to the year dot. Dunno about other brands of C compiler, though. Just as well since I have some sources that were written under OS/9 v2.4, so use the syntax defined in the original K&R edition and I hate having to edit a source file just because a new compiler version dropped support for everything except the latest syntax.

Thats one reason I don't like Python.

COBOL is another language that historically tended to support only the latest syntax, which is a pain since source files can be huge. I've worked on COBOL program modules that ran to over 5000 lines back in the day, i.e before 1978, when COBOL didn't yet support writing separately compiled subroutines (no LINKAGE SECTION), though AFAIK COBOL has always supported calling subroutines written in other languages).

--
--   
Martin    | martin at 
Gregorie  | gregorie dot org