PacoBlaze with multiply and 16-bit add/sub instructions

Do you have a question? Post it now! No Registration Necessary

Translate This Thread From English to

Threaded View
 Hello people.

 As I announced some days ago, I updated the PacoBlaze3 core
[http://bleyer.org/pacoblaze /] now with a wide ALU that supports an 8x8
multiply instruction ('mul') and 16-bit add/sub operations ('addw',
'addwcy', 'subw', 'subwcy'). The new extension core is called
PacoBlaze3M. It could be useful performing small DSP functions and math
subroutines when there is a spare hardware multiplier block.

 The implementation scheme modifies the PicoBlaze register model
dividing it in odd/even (high/low) sections with a multiplexing layer.
16-bit writes are performed on both odd/even registers. The multiply
operation accepts any two arbitrary registers and the wide add/sub
instructions operate on contiguous 16-bit "extended" registers.

 Eg: (KCAsm code)

---8<---

test_mul: ; mul example
    load s0, $ca ; s0 = 0xca
    load s2, $fe ; s2 = 0xfe
    mul s0, s2 ; = 0xca * 0xfe = 0xc86c

test_addw: ; addw example
    load s1, $ca ; s1 = 0xca ; mix cafe...
    load s0, $fe ; s0 = 0xfe

    load s3, $be ; s3 = 0xbe ; ...with beef
    load s2, $ef ; s2 = 0xef

    addw s2, s0 ; = 0xbeef + 0xcafe = 0x189ed ; yes, you got
189'ed :oP

--->8---

 I am having a bit of trouble intercepting the adder carry in a carry
chain with ISE using behavioral code. I am currently using two muxed
adders (one 8-bit, one 16-bit) for the addsub module instead of the
ideal high/low 8-bit adders with full and half carries. Any ideas on
how to implement this in ISE?

 I will focus now in adding better documentation and some verification
scripts. I also have a small language on the works (sarKCAsm --how
original) that is a macro assembler with operations to code in
Pico/PacoBlaze using commands like s0 = s1+s2, s4 += s5, etc. I will
release that as soon as I finish teaching myself ANTLR.

 Enjoy & rejoice ;o)

--
PabloBleyerKocik /"The danger from computers is not that they will
 pablo          / eventually get as smart as men, but that we will
We've slightly trimmed the long signature. Click to see the full one.
Re: PacoBlaze with multiply and 16-bit add/sub instructions

Quoted text here. Click to load it

Sounds impressive.
You have seen the AS Assembler, and the Mico8 from Lattice ?

FWIR the Mioo8 is very similar to PicoBlaze ( as expected, both are
tiny FPGA targeted CPUs ), but I think with a larger jump and call reach
(but simpler RET options).
If you are loading on features, the call-lengths might need attention ?

Have you tried targeting this to a lattice device ?

-jg




Re: PacoBlaze with multiply and 16-bit add/sub instructions
Quoted text here. Click to load it

 Yes, I am very much aware of Mico8 and I have used AS in several
projects in the past. I know that it supports PicoBlaze (and Mico8
now). But what I want to do now is a small version of a language like
HLA or terse for PicoBlaze. Something simple and readable that is easy
to modify like the current KCAsm (hey, adding the mul and add/sub
instructions took less than one minute. ;o)

 Here is what sarKCAsm is currently looking like (currently a JavaCC
implementation, but I am swapping to ANTLR now because it has better
support for trees).

---8<---

    s0 = $ca ; load
    s1 = s0 + $fe ; same as s1 = s0, s1 += $fe
    func($be, $ef) ; function call, s0 = $be, s1 = $ef

    s3 = 16

loop:
    func(s0, s1)
    s0 == $55 ; compare
    done Z? ; conditional jump
    s3 -= 1
    done Z?
    loop ; unconditional jump

done:
    done

func(s0: s0, s1): ; result + clobber list
    s0 <- $0 ; read from port 0
    s0 ^= s1 ; xor
    s1 << C ;  sla
    # ; return

--->8---

Quoted text here. Click to load it

 For now the limits of the PicoBlaze model have been within my needs
(IIRC, mico8 has the same 10-bit jumps/calls as PB3 and it is very
isomorphic to it). My main drive to create PacoBlaze was to get the
most versatile processor that I could use as a peripheral controller in
my projects (eg motor control, bus controller, PWM generator, audio
co-processor, specifically in the JBRD of my Javabotics project,
http://bleyer.org/javabotics /). It isn't difficult to extend the memory
model of PicoBlaze using PacoBlaze, though.

Quoted text here. Click to load it

 Not yet. I plan to synthesize the core using different tools that I
may have access to, but that is not in my list of priorities.

 Cheers.

--                 /"Naturally, there's got to be some
PabloBleyerKocik / limit, for I don't expect to live
 pablo          / forever, but I do intend to hang on
  @bleyer.org  / as long as possible." -- Isaac Asimov


Re: PacoBlaze with multiply and 16-bit add/sub instructions

Quoted text here. Click to load it

I realised that; - just checking you knew of them :)

Quoted text here. Click to load it

Good targets.



Will you also do boolean (Flag) functions ?

General comments: ( feel free to ignore... )

The expression clarity makes good sense, and I also like languages that
can accept flexible constants: viz $55 or 0x55 or 55H, or 2#01010101 or
16#55, or 2#01_0101_01.

I've also seen  XOR AND OR NOT etc keywords supported, as well as the
terse C equivalents. ( which are a real throwback to when source size
mattered ).

but I'm not sure about labels in the left most code-column - that makes
code harder to scan, and indent etc, and not as clear in a syntax
highighted editor....

ie If you have to add a comment, then the language is probably not clear
enough....

# for return ?         => why? - why not return, or RET or IFnZ RET
label then condition ? => most languages are IF_Z THEN or if_nZ DestAddr
Label for Loop jmp ?   => REPEAT Label, or LOOP label

If a 12yr old kid can read the source, and not need a raft of prior
knowledge, then that's a good test of any language :)

-jg





Re: PacoBlaze with multiply and 16-bit add/sub instructions

Quoted text here. Click to load it

  I think I recall the Mico8 had more obvious expansion space in the
opcodes - but either way, this is the sort of expansion that is nice to
allow for early-on.

  With more smarts, users _are_ going to need larger address space :)

  The assembler should accept either size, and warn on the
smaller/larger ceiling, based on a target/build family define.

-jg


Re: PacoBlaze with multiply and 16-bit add/sub instructions

Quoted text here. Click to load it

Cool, though I have not had had time to even get 2.0 running yet.. (
life got in the way of fun stuff )

Re: PacoBlaze with multiply and 16-bit add/sub instructions
Quoted text here. Click to load it

 I think you'll need to code dummy bits at the middle and top of
the adder to pull out the carries.

 Here are some old posts with structural (Xilinx) and RTL versions:
    http://groups.google.com/group/comp.lang.vhdl/msg/51a0b827ee12f69c
    http://groups.google.com/group/comp.lang.vhdl/msg/15ad1c9b11079f7a
    http://groups.google.com/group/comp.arch.fpga/msg/622e2173af20bb16

 IIRC, using two dummy bits at the top ( '0' & copy_of_sign_bit ) makes
coding synthesizable RTL signed/unsigned carry/borrow/overflow flags
easy to implement, but quickly googling didn't turn up the post that I
recall which explained that technique.

Brian


Re: PacoBlaze with multiply and 16-bit add/sub instructions

 Thanks for the pointers. I will try that.

 Cheers.


Re: PacoBlaze with multiply and 16-bit add/sub instructions
Quoted text here. Click to load it
 I couldn't turn up that other post that I recalled, but I dug
up a code snippet of the conditional signed skips of my own
homebrew processor. ( no mid-chain split, but overflow logic
coded with pad bits )

 Basically, the copy of the MSB input bits at bit position
MSB+1 lets you indirectly look for a difference in the carries
into and out of the MSB position in the inferred RTL adder.

gen_sgbt: if CFG_SKIP_GROUP_B = TRUE generate

  skip_b: block
     signal wide_diff : std_logic_vector( ALU_MSB+2 downto 0);
     signal pad_ar   : std_logic_vector( ALU_MSB+2 downto 0);
     signal pad_br   : std_logic_vector( ALU_MSB+2 downto 0);

     begin
       pad_ar <= ( '0' & ar(ALU_MSB) & ar );
       pad_br <= ( '0' & br(ALU_MSB) & br );

       wide_diff <= pad_ar - pad_br;

       -- sign, carry, overflow, zero bits
       cb_n <= wide_diff(ALU_MSB);
       cb_c <= wide_diff(ALU_MSB+2);
       cb_v <= wide_diff(ALU_MSB+1) XOR wide_diff(ALU_MSB);
       cb_z <= '1' when ( wide_diff(ALU_MSB downto 0) = ALU_ZERO ) else
'0';

     end block skip_b;

   --
   -- mux for skip_b condition decoding
   --
   skip_decode_b: process(skip_sense, skip_type, cb_z, cb_n, cb_c,
cb_v)
    variable skip_mux_b : std_logic;
    begin

      -- mux condition sources
      case skip_type is
        when CND_LO => skip_mux_b :=  cb_c;
        when CND_LS => skip_mux_b :=  cb_z OR  cb_c;
        when CND_LT => skip_mux_b :=  cb_n XOR cb_v;
        when CND_LE => skip_mux_b := (cb_n XOR cb_v) OR cb_z;
        when others => skip_mux_b := '1';
      end case;

      if skip_sense = '0' then
        skip_cond_b <= skip_mux_b;
      else
        skip_cond_b <= NOT skip_mux_b;
      end if;

    end process skip_decode_b;

end generate gen_sgbt;


Which implements:

SCCB   : skip conditions, group B

       000 0  skip.lo   lower                  unsigned, RA <  RB
       100 0  skip.hs   higher or same         unsigned, RA >= RB

       001 0  skip.ls   lower or same          unsigned, RA <= RB
       101 0  skip.hi   higher                 unsigned, RA >  RB

       010 0  skip.lt   less than              signed,   RA <  RB
       110 0  skip.ge   greater than or equal  signed,   RA >= RB

       011 0  skip.le   less than or equal     signed,   RA <= RB
       111 0  skip.gt   greater than           signed,   RA >  RB

 There's also a great explanation of generating conditionals and
overflows in sections 2-11 through 2-13 of "Hacker's Delight", Warren,
Addison Wesley, 2003

Brian


Re: PacoBlaze with multiply and 16-bit add/sub instructions
 Brian, with the following exploded setup I could finally instruct ISE
to merge two 8-bit adders to create a 16-bit one and multiplex out the
carry to get the half-carry. I don't know why my previous setups
failed... It saves 4 slices in a SP3 instead of having two separate
adders (the output mux is not considered) and the report indeed shows
that the fanout of the half MUXCY is 2.

 Regards.

---8<---

`define WIDTH 16

/* Two half adders to create a full one */
module addsub1(
    op, oc, y, a, b, c_in,
    c_out, h_out
);

input op, oc; // 0: add, 1: sub
output [`WIDTH-1:0] y;
input [`WIDTH-1:0] a, b;
input c_in;
output c_out;
output h_out;

wire [`WIDTH/2-1:0] yh, yl;

wire [`WIDTH/2-1:0]
    ah = a[`WIDTH-1:`WIDTH/2], al = a[`WIDTH/2-1:0];
wire [`WIDTH/2-1:0] bh, bl;
wire c =
    (!oc) ? 0 :
    (op) ? ~c_in : c_in;
wire d, e;

assign bh = (op) ? ~b[`WIDTH-1:`WIDTH/2] : b[`WIDTH-1:`WIDTH/2];
assign bl = (op) ? ~b[`WIDTH/2-1:0] : b[`WIDTH/2-1:0];

assign {d, yl} = al + bl + c;
assign {e, yh} = ah + bh + d;

assign h_out = (op) ? ~d : d;
assign c_out = (op) ? ~e : e;

assign y = {yh, yl};

endmodule

/*
=========================================================================
*                           HDL Synthesis
*
=========================================================================

Synthesizing Unit <addsub1>.
    Related source file is "C:/src/pacoblaze/pacoblaze/addsub.v".
    Found 8-bit adder carry in/out for signal <$n0001>.
    Found 8-bit adder carry in/out for signal <$n0002>.
    Found 1-bit 4-to-1 multiplexer for signal <c>.
    Summary:
    inferred   2 Adder/Subtractor(s).
    inferred   1 Multiplexer(s).
Unit <addsub1> synthesized.


=========================================================================
HDL Synthesis Report

Macro Statistics
# Adders/Subtractors                                   : 2
 8-bit adder carry in/out                              : 2
# Multiplexers                                         : 1
 1-bit 4-to-1 multiplexer                              : 1

=========================================================================

=========================================================================
*                       Advanced HDL Synthesis
*
=========================================================================


=========================================================================
Advanced HDL Synthesis Report

Macro Statistics
# Adders/Subtractors                                   : 2
 8-bit adder carry in/out                              : 2
# Multiplexers                                         : 1
 1-bit 4-to-1 multiplexer                              : 1

=========================================================================

=========================================================================
*                         Low Level Synthesis
*
=========================================================================
Loading device for application Rf_Device from file '3s200.nph' in
environment C:\Xilinx.

Optimizing unit <addsub1> ...

Mapping all equations...
Building and optimizing final netlist ...
Found area constraint ratio of 100 (+ 5) on block addsub1, actual ratio
is 0.

=========================================================================
*                            Final Report
*
=========================================================================
Final Results
RTL Top Level Output File Name     : addsub1.ngr
Top Level Output File Name         : addsub1
Output Format                      : NGC
Optimization Goal                  : Speed
Keep Hierarchy                     : NO

Design Statistics
# IOs                              : 53

Cell Usage :
# BELS                             : 67
#      LUT2                        : 34
#      LUT3                        : 1
#      MUXCY                       : 16
#      XORCY                       : 16
# IO Buffers                       : 53
#      IBUF                        : 35
#      OBUF                        : 18
=========================================================================

Device utilization summary:
---------------------------

Selected Device : 3s200pq208-5

 Number of Slices:                      19  out of   1920     0%
 Number of 4 input LUTs:                35  out of   3840     0%
 Number of bonded IOBs:                 53  out of    141    37%


=========================================================================
TIMING REPORT

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.
      FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT
      GENERATED AFTER PLACE-and-ROUTE.

Clock Information:
------------------
No clock signals found in this design

Timing Summary:
---------------
Speed Grade: -5

   Minimum period: No path found
   Minimum input arrival time before clock: No path found
   Maximum output required time after clock: No path found
   Maximum combinational path delay: 12.573ns

Timing Detail:
--------------
All values displayed in nanoseconds (ns)

=========================================================================
Timing constraint: Default path analysis
  Total number of paths / destination ports: 824 / 18
-------------------------------------------------------------------------
Delay:               12.573ns (Levels of Logic = 21)
  Source:            op (PAD)
  Destination:       c_out (PAD)

  Data Path: op to c_out
                                Gate     Net
    Cell:in->out      fanout   Delay   Delay  Logical Name (Net Name)
    ----------------------------------------  ------------
     IBUF:I->O            19   0.715   1.403  op_IBUF (op_IBUF)
     LUT2:I1->O            1   0.479   0.976  bl<0>1 (bl<0>)
     LUT2:I0->O            1   0.479   0.000  addsub1_yl<0>lut (N4)
     MUXCY:S->O            1   0.435   0.000  addsub1_yl<0>cy
(addsub1_yl<0>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yl<1>cy
(addsub1_yl<1>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yl<2>cy
(addsub1_yl<2>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yl<3>cy
(addsub1_yl<3>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yl<4>cy
(addsub1_yl<4>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yl<5>cy
(addsub1_yl<5>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yl<6>cy
(addsub1_yl<6>_cyo)
     MUXCY:CI->O           2   0.056   0.000  addsub1_yl<7>cy (d)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yh<0>cy
(addsub1_yh<0>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yh<1>cy
(addsub1_yh<1>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yh<2>cy
(addsub1_yh<2>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yh<3>cy
(addsub1_yh<3>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yh<4>cy
(addsub1_yh<4>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yh<5>cy
(addsub1_yh<5>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub1_yh<6>cy
(addsub1_yh<6>_cyo)
     MUXCY:CI->O           1   0.265   0.976  addsub1_yh<7>cy (e)
     LUT2:I0->O            1   0.479   0.681  c_out1 (c_out_OBUF)
     OBUF:I->O                 4.909          c_out_OBUF (c_out)
    ----------------------------------------
    Total                     12.573ns (8.538ns logic, 4.035ns route)
                                       (67.9% logic, 32.1% route)
*/


/* Two separate adders */
module addsub2(
    op, oc, y, yl, a, b, c_in,
    c_out, h_out
);

input op, oc; // 0: add, 1: sub
output [`WIDTH-1:0] y;
input [`WIDTH-1:0] a, b;
input c_in;
output c_out;
output h_out;

output [`WIDTH/2-1:0] yl;

wire [`WIDTH/2-1:0]
    al = a[`WIDTH/2-1:0];
wire [`WIDTH-1:0] bs;
wire [`WIDTH/2-1:0] bl;
wire c =
    (!oc) ? 0 :
    (op) ? ~c_in : c_in;
wire d, e;

assign bl = (op) ? ~b[`WIDTH/2-1:0] : b[`WIDTH/2-1:0];
assign bs = (op) ? ~b : b;

assign {d, yl} = al + bl + c;
assign {e, y} = a + bs + c;

assign h_out = (op) ? ~d : d;
assign c_out = (op) ? ~e : e;

endmodule


/*
=========================================================================
*                           HDL Synthesis
*
=========================================================================

Synthesizing Unit <addsub2>.
    Related source file is "C:/src/pacoblaze/pacoblaze/addsub.v".
    Found 16-bit adder carry in/out for signal <$n0001>.
    Found 8-bit adder carry in/out for signal <$n0002>.
    Found 1-bit 4-to-1 multiplexer for signal <c>.
    Summary:
    inferred   2 Adder/Subtractor(s).
    inferred   1 Multiplexer(s).
Unit <addsub2> synthesized.


=========================================================================
HDL Synthesis Report

Macro Statistics
# Adders/Subtractors                                   : 2
 16-bit adder carry in/out                             : 1
 8-bit adder carry in/out                              : 1
# Multiplexers                                         : 1
 1-bit 4-to-1 multiplexer                              : 1

=========================================================================

=========================================================================
*                       Advanced HDL Synthesis
*
=========================================================================


=========================================================================
Advanced HDL Synthesis Report

Macro Statistics
# Adders/Subtractors                                   : 2
 16-bit adder carry in/out                             : 1
 8-bit adder carry in/out                              : 1
# Multiplexers                                         : 1
 1-bit 4-to-1 multiplexer                              : 1

=========================================================================

=========================================================================
*                         Low Level Synthesis
*
=========================================================================
Loading device for application Rf_Device from file '3s200.nph' in
environment C:\Xilinx.

Optimizing unit <addsub2> ...

Mapping all equations...
Building and optimizing final netlist ...
Found area constraint ratio of 100 (+ 5) on block addsub2, actual ratio
is 1.

=========================================================================
*                            Final Report
*
=========================================================================
Final Results
RTL Top Level Output File Name     : addsub2.ngr
Top Level Output File Name         : addsub2
Output Format                      : NGC
Optimization Goal                  : Speed
Keep Hierarchy                     : NO

Design Statistics
# IOs                              : 61

Cell Usage :
# BELS                             : 91
#      LUT2                        : 34
#      LUT3                        : 9
#      MUXCY                       : 24
#      XORCY                       : 24
# IO Buffers                       : 61
#      IBUF                        : 35
#      OBUF                        : 26
=========================================================================

Device utilization summary:
---------------------------

Selected Device : 3s200pq208-5

 Number of Slices:                      23  out of   1920     1%
 Number of 4 input LUTs:                43  out of   3840     1%
 Number of bonded IOBs:                 61  out of    141    43%


=========================================================================
TIMING REPORT

NOTE: THESE TIMING NUMBERS ARE ONLY A SYNTHESIS ESTIMATE.
      FOR ACCURATE TIMING INFORMATION PLEASE REFER TO THE TRACE REPORT
      GENERATED AFTER PLACE-and-ROUTE.

Clock Information:
------------------
No clock signals found in this design

Timing Summary:
---------------
Speed Grade: -5

   Minimum period: No path found
   Minimum input arrival time before clock: No path found
   Maximum output required time after clock: No path found
   Maximum combinational path delay: 12.955ns

Timing Detail:
--------------
All values displayed in nanoseconds (ns)

=========================================================================
Timing constraint: Default path analysis
  Total number of paths / destination ports: 1012 / 26
-------------------------------------------------------------------------
Delay:               12.955ns (Levels of Logic = 21)
  Source:            op (PAD)
  Destination:       c_out (PAD)

  Data Path: op to c_out
                                Gate     Net
    Cell:in->out      fanout   Delay   Delay  Logical Name (Net Name)
    ----------------------------------------  ------------
     IBUF:I->O            27   0.715   1.721  op_IBUF (op_IBUF)
     LUT2:I1->O            2   0.479   1.040  bs<0>1 (bs<0>)
     LUT2:I0->O            1   0.479   0.000  addsub2_y<0>lut (N4)
     MUXCY:S->O            1   0.435   0.000  addsub2_y<0>cy
(addsub2_y<0>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<1>cy
(addsub2_y<1>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<2>cy
(addsub2_y<2>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<3>cy
(addsub2_y<3>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<4>cy
(addsub2_y<4>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<5>cy
(addsub2_y<5>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<6>cy
(addsub2_y<6>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<7>cy
(addsub2_y<7>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<8>cy
(addsub2_y<8>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<9>cy
(addsub2_y<9>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<10>cy
(addsub2_y<10>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<11>cy
(addsub2_y<11>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<12>cy
(addsub2_y<12>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<13>cy
(addsub2_y<13>_cyo)
     MUXCY:CI->O           1   0.056   0.000  addsub2_y<14>cy
(addsub2_y<14>_cyo)
     MUXCY:CI->O           1   0.265   0.976  addsub2_y<15>cy (e)
     LUT2:I0->O            1   0.479   0.681  c_out1 (c_out_OBUF)
     OBUF:I->O                 4.909          c_out_OBUF (c_out)
    ----------------------------------------
    Total                     12.955ns (8.538ns logic, 4.418ns route)
                                       (65.9% logic, 34.1% route)
*/

--
PabloBleyerKocik /
 pablo          /"It is a terrible thing to see and have no vision."
We've slightly trimmed the long signature. Click to see the full one.

Site Timeline