

# **Application Note**



# Compact Zynq System 2017.4 with SW-defined Floating-Point 8xSIMD EdkDSP Accelerator

Supported Trenz Electronic Modules: TE0720-03-2IF, TE0720-03-1QF, TE0720-03-14S-1C Supported Trenz Electronic Carrier Boards: TE0703-05, TE0706-02

> Jiří Kadlec, Zdeněk Pohl, Lukáš Kohout kadlec@utia.cas.cz, xpohl@utia.cas.cz, kohoutl@utia.cas.cz phone: +420 2 6605 2216 UTIA AV CR, v.v.i.

#### Revision history:

| Rev. | Date       | Author      | Description                                      |
|------|------------|-------------|--------------------------------------------------|
| 1    | 12.01.2018 | Jiří Kadlec | Initial internal draft for the Productive 4.0    |
|      |            |             | consortium meeting 17-18.1.2018 (Lisabon, PT).   |
|      |            |             | For Vivado and SDK ver. 2017.1                   |
| 2    | 30.01.2018 | Jiří Kadlec | Demonstrator description prior to the Productive |
|      |            |             | 4.0 project conference in Athens 6-7.3.2018      |
|      |            |             | For Vivado and SDK ver. 2017.1.                  |
| 3    | 15.05.2018 | Jiří Kadlec | Revision for Vivado and SDK ver. 2017.4.1        |
|      |            |             | Supported Trenz Electronic Zynq modules:         |
|      |            |             | TE0720-03-2IF, TE0720-03-1QF, TE0720-03-14S-1C   |
|      |            |             | Supported Trenz Electronic Carrier Boards:       |
|      |            |             | TE0703-05, TE0706-02                             |
|      |            |             |                                                  |
|      |            |             |                                                  |
|      |            |             |                                                  |

#### Acknowledgements:

This work has been partially supported by ECSEL JU project Productive 4.0 No. 737459.

# **Table of Contents**

| Compact Zynq System with SW-defined Floating-Point 8xSIMD EdkDSP Accelerator Supported Trenz Electronic Modu TE0720-03-2IF, TE0720-03-1QF, TE0720-03-14S-1C Supported Trenz Electronic Carrier Boards: TE0703-05, TE0706-02. |    |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 1. EdkDSP IP Core - Introduction                                                                                                                                                                                             | 4  |
| 2. Implementation Details                                                                                                                                                                                                    | 5  |
| 3. EdkDSP IP Core – PicoBlaze6 C Application Interface Functions                                                                                                                                                             | 11 |
| 4. EdkDSP IP Core – MicroBlaze C Application Interface Functions                                                                                                                                                             | 12 |
| 5. EdkDSP IP Core – Integration with dual core ARM A9 Linux                                                                                                                                                                  | 17 |
| 6. Setup of Hardware                                                                                                                                                                                                         | 18 |
| 7. Reference Application for the 8xSIMD EdkDSP IP Core                                                                                                                                                                       | 22 |
| 8. Installation and Use of Base Evaluation Package                                                                                                                                                                           | 24 |
| 9. Installation and Use of Debug Evaluation Package                                                                                                                                                                          | 39 |
| 10. References                                                                                                                                                                                                               | 54 |
| 11. Base Release Evaluation Package                                                                                                                                                                                          | 55 |
| 12. Extended Debug Evaluation Package for PRODUCTIVE 4.0 partners                                                                                                                                                            | 56 |
| Disclaimer                                                                                                                                                                                                                   | 58 |



# Table of Figures

| Figure 1: TE0703-05 carrier board with TE0720-03-14S-1C Zynq module                          | 4    |
|----------------------------------------------------------------------------------------------|------|
| Figure 2: SDSoC compatible Zynq system with 8xSIMD EdkDSP floating point accelerator         |      |
| Figure 3: 8xSIMD EdkDSP floating point accelerator IP core with System ILA                   |      |
| Figure 4: Internal details of (8xSIMD) EdkDSP floating point accelerator IP core             |      |
| Figure 5: TE0706-02; TE0720-03-14S-1C; USBUART and XMOD FTDI JTAG adapter                    |      |
| Figure 6: TE0706-02 Carrier Board                                                            |      |
| Figure 7: TE0703-05 Carrier Board                                                            |      |
| Figure 8: Release demo t01_s. ARM and 8xSIMD EdkDSP terminal output                          |      |
| Figure 9: Release demo t01_s. Vivado Lab Tool is open                                        |      |
| Figure 10: Release demo t01_s. Probes file is specified. Trigger conditions are set          |      |
| Figure 11: Release demo t01_s. Details of the 8xSIMD EdkDSP LMS filter computation           |      |
| Figure 12: Release demo t01_s. Details of the 8xSIMD EdkDSP FIR filter computation           |      |
| Figure 13: Release demo t01_s. Standalone demo supports measurements of the chip temperature | . 34 |
| Figure 14: Release demo t01_I. Linux start                                                   | 36   |
| Figure 15: Release demo t01_l; Login, Compilation of firmware in the EdkDSP C Compiler       | 37   |
| Figure 16: Release demo t01_l; Program and start 8xSIMD EdkDSP demo                          | 38   |
| Figure 17: Create new SDK 2017.4.1 workspace.                                                | 40   |
| Figure 18: Import the extended debug evaluation package projects into the SDK Workspace      |      |
| Figure 19: SDK compiles MicroBlaze SW projects for the standalone debug target               |      |
| Figure 20: Debug demo t01_l; Execution of the ./t01_s.elf example from the SD card           |      |
| Figure 21: Debug demo t01_s; Open project edkdsp_fp12_1x8_s for debug                        |      |
| Figure 22: Debug demo t01_s; Start the free-run from the debugger                            |      |
| Figure 23: Debug demo t01_s. Arm started EdkDSP and runs SDSoC akcelerátor demo              |      |
| Figure 24: Debug demo t01_s; MicroBlaze project output (Compiled for Debug)                  |      |
| Figure 25: Compiled EdkDSP firmware. Started debug demo - Linux target t01_l                 |      |
| Figure 26: Select MicroBlaze project edkdsp_fp12_1x8_I for debug.                            |      |
| Figure 27: Select free run of MicroBlaze project edkdsp_fp12_1x8_l                           |      |
| Figure 28: Output from ARM MicroBlaze fort t01_I. Compiled EdkDSP firmware                   |      |
| Figure 29: Create BOOT bin for the t01_s demo                                                |      |
| Figure 30: Create BOOT.bin for the t01_I demo                                                | 52   |
|                                                                                              |      |
| Table of Tables                                                                              |      |
| Table of Tables                                                                              |      |
|                                                                                              |      |
| Table 1: Parameters of supported Zyng modules.                                               | 5    |
| Table 2: (8xSIMD) EdkDSP bce_fp12_1x8_40 accelerator vector operations                       |      |
| Table 3: PicoBlaze6 ports forming VLIW instruction for the 8xSIMD EdkDSP data flow unit      |      |
| Table 4: PicoBlaze6 precompiled support functions                                            |      |
| Table 5: MicroBlaze access names to 8xSIMD EdkDSP memory banks                               |      |
| Table 6: MicroBlaze WAL error codes                                                          |      |
| Table 7: MicroBlaze API functions for communication with 8xSIMD EdkDSP IP core               |      |
| Table 8: Organisation of DDR3 memory                                                         |      |
| Table 9: Connection of USBUART to TE0706-02                                                  | 19   |
| Table 10: Connection of USBUART to TE0703-05                                                 |      |
| Table 11: Requirements and results.                                                          | 23   |
| Table 12: Description of ARM SDSoC acceleration examples compatible with 8xSIMD EdkDSP IP    | 25   |



#### 1. EdkDSP IP Core - Introduction

This report describes design of compact HW system based on Zynq all programmable 28nm chip with one or two Arm A9 processors and programmable logic area. System is optimised for Ethernet connected computing nodes serving for industrial automation, local data processing and data communication. The documented HW architecture is one of candidates for wider use within the ECSEL Productive 4.0 project for the edge computing node in the Industry 4.0 solutions. 2 carrier boards and 3 Zynq modules from Trenz Electronic are supported.

The demonstrated Zynq systems include the run-time reprogrammable 8xSIMD EdkDSP IP core. It combines the MicroBlaze and the floating point single instruction multiple data (SIMD) data flow unit (DFU). The SIMD DFU is controlled by a run-time reprogrammable finite state machine implemented by Xilinx PicoBlaze6 8 bit controller with dedicated embedded (on Zyng executed) C compiler.

The application note describes the installation of the HW system, the SW API, algorithmic implementation and mapping to the 8xSIMD EdkDSP IP. Presented HW system is also compatible with the Xilinx SDSoC 2017.4.1 design environment. The SDSoC is supporting automated compilation of user-defined C/C++ ARM functions into HW accelerators with several types of data movers (zero-copy, DMA, SG-DMA) and the automated integration of generated accelerators as an ARM Linux operating system or standalone application.



Figure 1: TE0703-05 carrier board with TE0720-03-14S-1C Zynq module



# 2. Implementation Details



Figure 2: SDSoC compatible Zynq system with 8xSIMD EdkDSP floating point accelerator.

#### **Evaluation system parameters**

The evaluation system supports two Trenz Electronic carrier boards (**TE0703-05** and **TE0706-02**) [3] and three types of Trenz Electronic Zynq modules [1]:

- **TE0720-03-2IF** is an industrial grade (Tj = -40°C to +100°C) module, speed 2 with dual core Arm A9. The dual core Arm A9 and the PL part are **faster** in comparison to the other two modules.
- **TE0720-03-1QF** is an automotive grade (Tj = -40°C to +125°C) module, speed 1 with dual core Arm A9. This module can be used in applications requiring **wide temperature range**. Module is more expensive.
- **TE0720-03-14S-1C** is a commercial grade (Tj = 0°C to +85°C) module, speed 1 with **single core** Arm Cortex A9 and reduced programmable logic (PL) size. This is **low cost module** suitable for cost sensitive applications.

Main parameters of these modules are summarised in Table 1.

*Table 1: Parameters of supported Zyng modules.* 

| Module           | Xilinx Zynq device | ARM A9 | A9 clock | Slices | LUTs  | REGs   | BRAMs | DSPs |
|------------------|--------------------|--------|----------|--------|-------|--------|-------|------|
| TE0720-03-2IF    | XC7Z020-2CLG484I   | 2x     | 766MHz   | 13300  | 53200 | 106400 | 140   | 220  |
| TE0720-03-1QF    | XA7Z020-1CLG484Q   | 2x     | 666MHz   | 13300  | 53200 | 106400 | 140   | 220  |
| TE0720-03-14S-1C | XC7Z014S-1CLG484C  | 1x     | 666MHz   | 13300  | 40600 | 81200  | 107   | 170  |

5/58



The PL part of the 28nm Zynq device contains:

- The run-time reprogrammable 8xSIMD EdkDSP floating point IP Core. It is using 120 MHz clock in case of the faster TE0720-03-2IF module and 100 MHz clock in case of the other two modules.
- MicroBlaze 32 bit soft core processor operating at 100 MHz.
- One of HW accelerators generated in Xilinx SDSoC 2017.4.1 from C/C++ reference SW ARM A9 function and operating with 150 MHz, 120 MHz, 100 MHz or 50 MHz clock.

The EdkDSP IP Core is 8xSIMD floating point accelerator. It is reprogrammable in runtime by change of firmware of a PicoBlaze6 8bit controller. The PicoBlaze6 controller schedules vector operations performed in the 8xSIMD floating point data paths. The PicoBlaze6 controller serves as re-programmable finite state machine (FSM). It is programmed by firmware compiled by an EdkDSP C Compiler and Assembler.

The EdkDSP C Compiler and Assembler are implemented as application programs running on the embedded PetaLinux 2017.4.1 operating system. The 8xSIMD EdkDSP IP is controlled by the 32bit MicroBlaze processor.

The MicroBlaze runs programs from the DDR3 memory. The DDR3 is interfaced by an Instruction and Data cache (32k x 32bit) with HPO AXI interface.

The 8xSIMD EdkDSP IP is connected to the MicroBlaze by local dual-ported memories. MicroBlaze implements data communication from DDR3 to 8xSIMD EdkDSP dual-ported memories in software. This communication is performed in parallel with the 8xSIMD parallel floating point computation in the 8xSIMD EdkDSP IP.

#### Parameters of the 8xSIMD EdkDSP IP core

8x SIMD EdkDSP floating point accelerator IP core supports 8xSIMD vector floating point operations performed from/to dual-ported BRAMs A, B, Z. Each dual-ported BRAM has 8 parallel layers of 1024 32 bit words. The set of supported floating point operations is different for different grades [10|20|30|40] of the 8xSIMD EdkDSP accelerator IPs. The supported floating point operations are summarised in Table 2.

- The accelerator **bce\_fp12\_1x8\_0\_axiw\_v1\_10** is area **optimized** and supports only data transfers and vector floating point operations FPADD, FPSUB in 8 SIMD data paths.
- The accelerator **bce\_fp12\_1x8\_0\_axiw\_v1\_20** performs identical operations as bce\_fp12\_1x8\_0\_axiw\_v1\_10 plus the vector floating point MAC operations in 8 SIMD data paths. MAC is supported for length of vectors 1 up to 10. This accelerator is optimized for applications like floating point matrix multiplication with one row and column dimensions <= 10.
- The accelerator bce\_fp12\_1x8\_0\_axiw\_v1\_30 supports identical operations as bce\_fp12\_1x8\_0\_axiw\_v1\_20 plus HW-accelerated computation of the floating point vector-by-vector dot-product operators performed in 8 SIMD data paths. It is optimized for parallel computation of up to 8 FIR or LMS filters, each with size up to 250 coefficients. It is also efficient in case of floating point matrix by matrix multiplications, where one of the dimensions is large (in the range from 11 to 250).
- The accelerator **bce\_fp12\_1x8\_0\_axiw\_v1\_40** supports identical operations as bce\_fp12\_1x8\_0\_axiw\_v1\_30 plus an additional HW support of dot product. It is computed in 8 data paths with HW-supported wind-up into single scalar result propagated into all SIMD planes.

All bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40] accelerators support single data path for pipelined, floating-point division operations with vector operands taken from the first SIMD plain and the result is propagated into all 8 SIMD plains.

All **bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40]** accelerators are suitable for applications like adaptive normalised LMS and NLMS filters and square root free versions of adaptive RLS QR filters and adaptive RLS LATTICE filters.



Table 2: (8xSIMD) EdkDSP bce\_fp12\_1x8\_40 accelerator vector operations.

| Name in MicroBlaze C value | (dec) | 8xSIMD Floating point Operation                                                                                             |
|----------------------------|-------|-----------------------------------------------------------------------------------------------------------------------------|
| WAL_BCE_JK_VVER            | = 0   | Return capabilities of the (8xSIMD) EdkDSP accelerator                                                                      |
| WAL_BCE_JK_VZ2A            | = 1   | 8xSIMD copy $a_m[i] \le z_m[j]; m=18$ IP core: 10,20,30,40                                                                  |
| WAL_BCE_JK_VB2A            | = 2   | 8xSIMD copy $a_m[i] \le b_m[j]; m=18$ IP core: 10,20,30,40                                                                  |
| WAL_BCE_JK_VZ2B            | = 3   | 8xSIMD copy $b_m[i] \le z_m[j]; m=18$ IP core: 10,20,30,40                                                                  |
| WAL_BCE_JK_VA2B            | = 4   | 8xSIMD copy $b_m[i] \le a_m[j]; m=18$ IP core: 10,20,30,40                                                                  |
|                            |       |                                                                                                                             |
| WAL_BCE_JK_VADD            | = 5   | 8xSIMD add $z_m[i] \le a_m[j] + b_m[k]$ ; m=18 IP core: 10,20,30,40                                                         |
| WAL_BCE_JK_VADD_BZ2A       | = 6   | 8xSIMD add $a_m[i] \le b_m[j] + z_m[k]$ ; m=18 IP core: 10,20,30,40                                                         |
| WAL_BCE_JK_VADD_AZ2B       | = 7   | 8xSIMD add $b_m[i] \le a_m[j] + z_m[k]$ ; m=18 IP core: 10,20,30,40                                                         |
|                            |       |                                                                                                                             |
| WAL_BCE_JK_VSUB            | = 8   | 8xSIMD sub $z_m[i] \le a_m[j] - b_m[k]$ ; m=18 IP core: 10,20,30,40                                                         |
| WAL_BCE_JK_VSUB_BZ2A       | = 9   | 8xSIMD sub $a_m[i] \le b_m[j] - z_m[k]$ ; m=18 IP core: 10,20,30,40                                                         |
| WAL_BCE_JK_VSUB_AZ2B       | = 10  | 8xSIMD sub $b_m[i] \le a_m[j] - z_m[k]$ ; m=18 IP core: 10,20,30,40                                                         |
|                            |       |                                                                                                                             |
| WAL_BCE_JK_VMULT           | = 11  | 8xSIMD mult $z_m[i] \le a_m[j] * b_m[k]; m=18$ IP core: 10,20,30,40                                                         |
| WAL_BCE_JK_VMULT_BZ2A      | = 12  | 8xSIMD mult $a_m[i] \le b_m[j] * z_m[k]; m=18$ IP core: 10,20,30,40                                                         |
| WAL_BCE_JK_VMULT_AZ2B      | = 13  | 8xSIMD mult $b_m[i] \le a_m[j] * z_m[k]; m=18$ IP core: 10,20,30,40                                                         |
|                            |       |                                                                                                                             |
| WAL_BCE_JK_VPROD           | = 14  | 8xSIMD vector products: IP core: 30,40                                                                                      |
|                            |       | $z_m[i] \le a_m'[jj+nn]*b_m[kk+nn]; m=18; nn range 1255$                                                                    |
|                            |       |                                                                                                                             |
| WAL_BCE_JK_VMAC            | = 15  | 8xSIMD vector MACs: IP core: 20,30,40                                                                                       |
|                            |       | $z_m[ii+nn] \le z_m[ii+nn] + a_m[jj+nn] * b_m[kjk+nn];$                                                                     |
|                            |       | nn range 113                                                                                                                |
| WAL_BCE_JK_VMSUBAC         | = 16  | 8xSIMD vector MSUBACs IP core: 20,30,40                                                                                     |
|                            |       | $z_m[ii+nn] \le z_m[ii+nn] - a_m[jj+nn] * b_m[kjk+nn];$                                                                     |
| WAL BOT W. 1755 25         |       | nn range 113                                                                                                                |
| WAL_BCE_JK_VPROD_S8        | = 17  | 8xSIMD vector product (extended) IP core: 40                                                                                |
|                            |       | $z_m[i] \le ((a_1'[jj+nn]*b_1[kk+nn]+a_2'[jj+nn]*b_2[kk+nn])$                                                               |
|                            |       | + (a <sub>3</sub> '[jj+nn]*b <sub>3</sub> [kk+nn]+a <sub>4</sub> '[jj+nn]*b <sub>4</sub> [kk+nn]) )                         |
|                            |       | +                                                                                                                           |
|                            |       | $(a_5'[jj+nn]*b_5[kk+nn]+a_6'[jj+nn]*b_6[kk+nn])$                                                                           |
|                            |       | + (a <sub>7</sub> '[jj+nn]*b <sub>7</sub> [kk+nn]+a <sub>8</sub> '[jj+nn]*b <sub>8</sub> [kk+nn]) );<br>m=18; nn range 1255 |
| WAL_BCE_JK_VDIV            | = 20  | vector division (extended)  IP core: 10,20,30,40                                                                            |
| WALDCL_JN_VDIV             | - 20  | $z_m[i] \le a_1[j] / b_1[k]; m=18$                                                                                          |
|                            |       | ∠m[i] \- a1[j] / D1[N], III-10                                                                                              |



#### Ports of the 8xSIMD EdkDSP accelerator

| • | bce_atoa[0:9]    | Memory A address (addressing 1024 32 bit floating point values)                    |
|---|------------------|------------------------------------------------------------------------------------|
| • | bce_atob[0:9]    | Memory B address (addressing 1024 32 bit floating point values)                    |
| • | bce_atoz[0:9]    | Memory Z address (addressing 1024 32 bit floating point values)                    |
| • | bce_done[0:7]    | Vector operation in progress or finished                                           |
| • | bce_led4b[0:3]   | 4 bit output, intended for led signalling. (Unconnected in the evaluation design). |
| • | bce_mode[0:3]    | Mode of the communication protocol PicoBlaze6 - MicroBlaze                         |
| • | bce_op[0:7]      | Vector operation to be performed.                                                  |
| • | bce_port[0:7]    | 8 bit output port. (Unconnected in the evaluation design).                         |
| • | bce_port_id[0:7] | 8 bit output External port address.                                                |
|   |                  | Address space [0x0 0x1F] is reserved for optimized construction of the VLIW        |
|   |                  | instruction to the 8xSIMD vector processing unit of the EdkDSP.                    |
|   |                  | Address space [0x20 0xFF] can be used by the user.                                 |
| • | bce_port_wr      | 1 bit output. Write strobe for write of 8 bit data to the external port address.   |
| • | bce_r_pb         | 1 bit output. Reset of the PicoBlaze6.                                             |
| • | bce_we           | 1 bit output. Write strobe signals start of execution of a VLIW instruction by the |
|   |                  | 8xSIMD vector processing unit of the EdkDSP.                                       |
| • | bce_dip4b[0:3]   | 4bit input (Connected to a constant in the evaluation design).                     |
| • | Bce_gpi8b[0:7]   | 8bit input (Connected to a constant in the evaluation design).                     |
|   |                  |                                                                                    |



Figure 3: 8xSIMD EdkDSP floating point accelerator IP core with System ILA.

#### Interface of the 8xSIMD EdkDSP IP to the MicroBlaze processor

The EdkDSP IP core is connected to the 100 MHz MicroBlaze processor via the 100 MHz 32bit AXI lite bus represented by port **s\_axi**, 100 MHz clock input **axi\_aclk** and an asynchronous reset signal **axi\_aresetn**. See *Figure 3*.



The debug ports are used for the real-time visualisation, debug and analysis of the computation implemented inside of the 8xSIMD data flow unit (DFU) of the (8xSIMD) EdkDSP accelerator IP. This makes easier to debug the compiled PicoBlaze6 firmware code. The implemented in circuit logic analyser (System ILA) debug probes can capture 8192 data samples in case of TE0720-03-2IF and TE0720-03-1QF module and 2048 data samples in case of TE0720-03-14S-1C module. System ILA provides visibility for the auto-generated addresses and for the detailed schedule of vector operation in the 8xSIMD EdkDSP IP core. See *Figure 3*.

Figure 4 presents connection of the two parts of the 8xSIMD EdkDSP IP core.



Figure 4: Internal details of (8xSIMD) EdkDSP floating point accelerator IP core.

All bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40] accelerators versions have identical Edk IP part.

The DSP part has identical ports and connectivity for all bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40] accelerators versions.

The Edk part of the EdkDSP floating point accelerator IP core bce\_fp12\_1x8\_0\_axiw\_v1\_0\_c includes inside the PicoBlaze6 controller, its program memories P0 and P1 and the 8xSIMD dual-ported block-ram memories 8xA, 8xB and 8xZ designed for parallel access. The bce\_fp12\_1x8\_0\_axiw\_v1\_0\_c IP is designed in the Xilinx System Generator 14.5 and ported to the Vivado 2017.4.1 compatible IP core. The PicoBlaze6 firmware executes C code and supports C constructs like loops, while, if, else, function calls etc.

The first of the two ports of all block-rams are accessed by the MicroBlaze as memory via the Axi-lite bus.

- The second of the two ports of both program memories P0 and P1 are connected to the PicoBlaze6 controller.
- The second of the two ports of all data memories 8xA, 8xB and 8xZ are connected to the floating point data paths of the data flow unit (DFU) unit and support parallel access.



The DFU **bce\_fp12\_1x8\_0\_dsp** is designed in the Xilinx System Generator for DSP 2017.4.1. It contains 8 pipelined floating point ADD units, 8 pipelined floating point MULT units and one pipelined floating point DIV unit. The DFU supports all vector operations defined in Table 2.

- The 100bit VLIW instruction is transferred in two 50bit ports mem\_bce\_i\_lo and mem\_bce\_i\_hi. The
  VLIW instruction is set by dedicated PicoBlaze6 output ports. See Table 3.
- The 8xSIMD data flow unit executes 8xSIMD floating point operations defined in Table 2.
- The concrete 8xSIMD operation is defined by the PicoBlaze6 DFU\_OP 8bit output register driving the mem\_bce\_op port of the bce\_fp12\_1x8\_0\_axiw\_v1\_0\_c IP. The transfer of the complete VLIW instruction (100+8 bits) is triggered by the write strobe signal mem\_bce\_we. It is activated by PicoBlaze6 program write of the 8xSIMD operation DFU\_OP. See Table 3.

The 8xSIMD data flow unit (DFU) indicates end of the operation in the 8bit output port mem\_bce\_done. PicoBlaze6 program can execute few instructions in parallel to the 8xSIMD operation defined in DFU\_OP. End of the 8xSIMD operation is detected by the PicoBlaze6 program by reading of the input 8bit port mem\_bce\_done. PicoBlaze6 firmware defines the sequence of VLIW instructions for the 8xSIMD DFU unit by its dedicated output registers. PicoBlaze6 addresses of these dedicated output registers are listed in Table 3.

Table 3: PicoBlaze6 ports forming VLIW instruction for the 8xSIMD EdkDSP data flow unit.

|                                             | 1           | T             |                                        |
|---------------------------------------------|-------------|---------------|----------------------------------------|
| PicoBlaze6 registers used for definition of | Format      | VLIW          | Description of sections defined in the |
| the 100 bit wide VLIW instruction for the   | [msblsb]    | [2x 50bit]    | VLIW instruction for the EdkDSP Data   |
| EdkDSP Data Flow Unit                       |             | mem_bce_i_hi  | Flow Unit                              |
|                                             |             | mem_bce_i_lo  |                                        |
| [00b, DFU_CNT]                              | [2bit,8bit] | 10 bit [4940] | Number of 8xSIMD steps (0 255)         |
| [00b, DFU_Z_INC]                            | [2bit,8bit] | 10 bit [3930] | Auto increment of Z address (0 255)    |
| [DFU_Z_MEM_BANK, DFU_Z_MEM_SADDR]           | [2bit,8bit] | 10 bit [2920] | Set Z address after auto incr overflow |
| [DFU_Z_MEM_BANK, DFU_Z_MEM_ADDR]            | [2bit,8bit] | 10 bit [1910] | Initial Z address                      |
| [00b, DFU_B_INC]                            | [2bit,8bit] | 10 bit [0900] | Auto increment of B address (0 255)    |
| [DFU_B_MEM_BANK, DFU_B_MEM_SADDR]           | [2bit,8bit] | 10 bit [4940] | Set B address after auto incr overflow |
| [DFU_B_MEM_BANK, DFU_B_MEM_ADDR]            | [2bit,8bit] | 10 bit [3920] | Initial B address                      |
| [00b, DFU_A_INC]                            | [2bit,8bit] | 10 bit [2920] | Auto increment of A address (0 255)    |
| [DFU_A_MEM_BANK, DFU_A_MEM_SADDR]           | [2bit,8bit] | 10 bit [1910] | Set A address after auto incr overflow |
| [DFU_A_MEM_BANK, DFU_A_MEM_ADDR]            | [2bit,8bit] | 10 bit [0900] | Initial A address                      |
|                                             |             |               |                                        |
| [0000b, PBP_REG01]                          | [4bit,4bit] | 8 bit         | Set actual VLIW instr. memory (0 15)   |
| [DFU_OP]                                    | [8bit]      | 8 bit         | Execute SIMD operation with            |
|                                             |             |               | parameters in the actual VLIW instr.   |
|                                             |             |               | memory (set by the PBP_REG01 port).    |



# 3. EdkDSP IP Core - PicoBlaze6 C Application Interface Functions

The EdkDSP compiler embedded compilation of simple C and ASM programs or the PicoBlaze6 controller. PicoBlaze6 programs can use predefined and precompiled library functions listed in *Table 4*. Functions are optimized in the PicoBlaze6 assembler code, and occupy fixed area of the firmware and serve as common simple API for C and ASM PicoBlaze6 programs.

PicoBlaze6 firmware image with precompiled support functions is present in MicroBlaze header file **fill\_def\_program\_store.h** PicoBlaze6 application program firmware is merged with this precompiled image by the MicroBlaze SW program.

Table 4: PicoBlaze6 precompiled support functions

| PicoBlaze6 predefined functions                                         | Description                                                         |
|-------------------------------------------------------------------------|---------------------------------------------------------------------|
| orizontales established                                                 | Civil and the foundation Plants Pin Plants                          |
| unsigned char mb2pb_read_data();                                        | Single unsigned char from MicroBlaze to PicoBlaze6                  |
| void pb2mb_write(unsigned char data);                                   | Single unsigned char from PicoBlaze6 to MicroBlaze                  |
| <pre>void pb2mb_eoc(unsigned char data);</pre>                          | EOC unsigned char from PicoBlaze6 to MicroBlaze                     |
| <pre>void pb2mb_req_reset(unsigned char data);</pre>                    | Request from PicoBlaze6 to MicroBlaze to initiate PB reset          |
| <pre>void pb2mb_reset();</pre>                                          | Information from PicoBlaze6 to MicroBlaze - PB reset                |
| void pb2dfu_set(unsigned char mem,                                      | Set one section of the VLIW instruction for the data flow unit      |
| unsigned char data);                                                    | (DFU) to an unsigned char data. VLIW instruction sections are       |
|                                                                         | addressed as PicoBlaze6 8bit output ports defined in <i>Table 3</i> |
| void pb2dfu_wait4hw();                                                  | PicoBlaze6 function is waiting for the termination of data flow     |
|                                                                         | unit operation.                                                     |
| unsigned char led2pb();                                                 | Write from PicoBlaze6 to 4 bit led output port                      |
| unsigned char btn2pb();                                                 | Read from 4 bit input port to PicoBlaze6                            |
| unsigned char hex_h(unsigned char ch);                                  | Translate upper 4 bit nibble of an unsigned char to ascii           |
| unsigned char hex_l(unsigned char ch);                                  | Translate lower 4 bit nibble of an unsigned char to ascii           |
| <pre>void pb2lcd_ascii_char(unsigned char ch, unsigned char pos);</pre> | Write from PicoBlaze6 to LCD asci alphanumerical display            |



# 4. EdkDSP IP Core – MicroBlaze C Application Interface Functions

MicroBlaze program is responsible for data communication, programming and initialization of the PicoBlaze6 and global scheduling of the implemented algorithm. The API providing MicroBlaze - Picoblaze6 interface is called Worker Abstraction Layer (WAL).

- 8xSIMD EdkDSP memory pointers and program memory pointers (from MicroBlaze view) are defined in *Table 5*.
- WAL error codes are defined in *Table 6*.
- 8xSIMD EdkDSP is supported by API functions collected in the WAL API are listed and described in *Table* 7.

Table 5: MicroBlaze access names to 8xSIMD EdkDSP memory banks

| MicroBlaze access names | Description of the 8xSIMD EdkDSP memory banks                      |
|-------------------------|--------------------------------------------------------------------|
| WAL_BCE_JK_DMEM_A       | index of the A data memory banks (8x [01023] 32bit words)          |
| WAL_BCE_JK_DMEM_B       | index of the B data memory banks (8x [01023] 32bit words)          |
| WAL_BCE_JK_DMEM_Z       | index of the Z data memory banks (8x [01023] 32bit words)          |
|                         |                                                                    |
| WAL_CMEM_MB2PB          | index to MB2PB control memory (the control register of the worker) |
| WAL_CMEM_PB2MB          | index to PB2MB control memory (the status register of the worker)  |
| WAL_PBID_P0             | index to P0 control memory (PicoBlaze program memory 1)            |
| WAL_PBID_P1             | index to P1 control memory (PicoBlaze program memory 2)            |

Table 6: MicroBlaze WAL error codes

| MicroBlaze WAL codes | Value | Description                 |
|----------------------|-------|-----------------------------|
| WAL_RES_OK           | 0     | all is OK                   |
| WAL_RES_WNULL        | 1     | argument is a NULL          |
| WAL_RES_ERR          | -1    | generic error               |
| WAL_RES_ENOINIT      | -2    | not initiated               |
| WAL_RES_ENULL        | -3    | null pointer                |
| WAL_RES_ERUNNING     | -4    | worker is running           |
| WAL_RES_ERANGE       | -5    | index/value is out of range |

Table 7: MicroBlaze API functions for communication with 8xSIMD EdkDSP IP core

#### MicroBlaze API functions for communication with 8xSIMD EdkDSP IP core

wal init worker() - generalised function for worker initialising

This function is designed for calling from user application. The function checks if the \*wrk structure is prepared to initiate worker (the family description structure must be set). Then the assigned family function (init\_wrk()) is called. In the called function all arrays of pointers to shared memories should be initiated.

Return Value: The function returns return code WAL RES OK if successful and WAL RES E... if any error occurs.

12/58

int wal\_init\_worker(struct wal\_worker \*wrk);





<sup>\*</sup>wrk is a pointer to the worker structure.

#### wal done worker - generalised function for worker clean-up

\*wrk is a pointer to the worker structure

This function is designed for calling from user application. The function calls done function (done\_wrk()) assigned to family description structure. In the called function all dynamically allocated worker structures, memories and resources should be clean-up and released if they have been created in the worker init function.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_done\_worker(struct wal\_worker \*wrk);

#### wal\_reset\_worker() - generalised function for worker hard reset

\*wrk is a pointer to the worker structure

This function is designed for calling from user application. The function calls reset function (reset\_wrk()) assigned to the family description structure. In the called function the worker control registers should be reset (by HARD RESET bit in the worker control register). The reset is not acknowledged by accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_reset\_worker(struct wal\_worker \*wrk);

wal start operation() - generalised function for starting operation on the accelerator.

\*wrk is a pointer to the worker structure. \*pbid is an index of used PB firmware ( WAL\_PBID\_...)

This function is designed for calling from user application. The function checks if the accelerator is in the idle state and then it calls function for starting operation (start\_op()) assigned to the family description structure. The called function should start a new accelerator operation by setting accelerator control register and checking status register. This function is blocking, i.e. it waits for acknowledgement from accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_start\_operation(struct wal\_worker \*wrk, unsigned int pbid);

wal end operation() - generalised function for finishing operation on the accelerator.

\*wrk is a pointer to the worker structure.

This function is designed for calling from user application. The function checks if the accelerator is in processing state and then it calls function for ending operation (end\_op()) assigned to the family description structure. The called function should stop processing operation on the accelerator. And it waits for synchronization with the accelerator, therefore the function is blocking.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_end\_operation(struct wal\_worker \*wrk);

**wal\_mb2pb()** - generalised function for setting worker control register.

\*wrk is a pointer to the worker structure. data is user data to be send to worker control register.

This function is designed for calling from user application. The function calls function for setting worker control



register (mb2pb()) assigned to the family description structure. The called function should send user data through control register with controlling READ bit. It should also waits for synchronization with accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal mb2pb(struct wal worker \*wrk, const uint32 t data);

wal\_pb2mb() - generalised function for reading worker status register.

\*wrk is a pointer to the worker structure. \*data is a pointer to an output buffer where read user data is written.

This function is designed for calling from user application. The function calls function for reading worker status register (pb2mb()) assigned to the family description structure. The called function should read user data through worker status register with waiting for synchronization with accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_pb2mb(struct wal\_worker \*wrk, uint32\_t \*data);

wal\_mb2cmem() - generalised function for writing a block of data to any worker control or support memory

\*wrk is a pointer to the worker structure. memid is an index of control/support memory where data are written to ( WAL\_CMEM\_... or WAL\_...\_SMEM\_...). memoffs is offset in selected memory (in words not in bytes). outbuf is a pointer to memory where data are read from. len is a number of words to copy from outbuf to accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for writing data to any control/support memory (mb2cmem()) assigned to the family description structure. The called function should get a pointer to the right memory according to the required index **memid**. For accessing support memories they have to define indices greater then indices to control memories. Then the called function should copy a block of data from CPU memory **outbuf** to an accelerator control/support memory selected by **memid** and offset in selected memory **memoffs**.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_mb2cmem(struct wal\_worker \*wrk, unsigned int memid, unsigned int memoffs, const uint32\_t \*outbuf, unsigned int len);

wal\_cmem2mb() - generalised function for reading a block of data from any worker control or support memory

\*wrk is a pointer to the worker structure. memid is an index of control/support memory where data are read from

( WAL\_CMEM\_... or WAL\_...\_SMEM\_...). **memoffs** is offset in selected memory (in words not in bytes). \*inbuf is a pointer to memory where data are written to. len is a number of words to copy from accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for reading data from any control/support memory (cmem2mb()) assigned to the family description structure. The called function should get a pointer to the right memory according to the required index **memid**. For accessing support memories they have to define indices greater then indices to

signal processing



control memories. Then the called function should copy a block of data from the accelerator control/support memory selected by **memid** and offset in selected memory **memoffs**.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_cmem2mb(struct wal\_worker \*wrk, unsigned int memid, unsigned int memoffs, uint32 t \*inbuf, unsigned int len);

wal\_mb2dmem() - generalised function for writing a block of data to any worker data memory

\*wrk is a pointer to the worker structure. simdid is an index of SIMD which data memories are indexed. memid is an index of control/support memory where data are written to (WAL\_CMEM\_... or WAL\_...\_SMEM\_...). memoffs is offset in selected memory (in words not in bytes). \*outbuf is a pointer to memory where data are read from. len is a number of words to copy from \*outbuf to accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for writing data to any data memory (mb2dmem()) assigned to the family description structure. The called function should get a pointer to the right memory according to the required SIMD **simdid** and memory index **memid**. Then the called function should copy a block of data from CPU memory \*outbuf to the accelerator data memory with offset inside the selected memory **memoffs**.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_mb2dmem(struct wal\_worker \*wrk, unsigned int simdid, unsigned int memid, unsigned int memoffs, const void \*outbuf, unsigned int len);

#### wal\_dmem2mb() - generalised function for writing a block of data to any worker data memory

\*wrk is a pointer to the worker structure. simdid is an index of SIMD which data memories are indexed. memid is an index of control/support memory where data are read from ( WAL\_CMEM\_... or WAL\_...\_SMEM\_...). memoffs is offset in selected memory (in words not in bytes). \*inbuf is a pointer to memory where data are written to. len is a number of words to copy from accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for reading data from any data memory (dmem2mb()) assigned to the family description structure. The called function should get pointer to the right memory according to the required SIMD **simdid** and memory index **memid**. Then the called function should copy a block of data from the accelerator data memory with offset inside the selected memory **memoffs**.

Return Value: The function returns WAL RES ... codes.

int wal\_dmem2mb(struct wal\_worker \*wrk, unsigned int simdid, unsigned int memid, unsigned int memoffs, void \*inbuf, unsigned int len);

#### wal\_set\_firmware() - generalised function for writing PicoBlaze firmware

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware ( WAL\_PBID\_...). \*fwbuf is a pointer to a firmware in CPU memory. fwsize is a size of the firmware in words, it can be a negative value to set full firmware (4096 words).

This function is designed for calling from user application. The function checks if all arguments are correct and then it calls function for writing PB firmware (set\_fw()). The called function should copy firmware from CPU memory \*fwbuf to PicoBlaze6 program memory in the accelerator. The PB program memory is selected by the

signal processing



argument **pbid**. The firmware needn't be full 4096 word long. The firmware length (in words) can be set by the argument **fwsize**. If the **fwsize** is a negative value (you can use defined value WAL\_FW\_WHOLE) the function assumes the FW length is 4096 words.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_set\_firmware(struct wal\_worker \*wrk, int pbid, const unsigned int \*fwbuf, int fwsize);

wal bce jk get id() - implementation of the worker get id() function for the BCE JK families

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware ( WAL\_PBID\_...). **outval** is a pointer to an output buffer for read worker ID.

The function emulates reading worker ID from hardware because the BCE\_JK families don't support this operation in the hardware.

Return Value: The function always returns WAL\_RES\_OK.

int wal get id(struct wal worker \*wrk, int pbid, unsigned int \*outval);

wal bce jk get cap() - implementation of the worker get cap() function for the BCE JK families

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware ( WAL\_PBID\_...). \*outval is a pointer to an output buffer for read capabilities.

The function sends operation WAL\_BCE\_JK\_VVER to accelerator, reads the worker capabilities and returns the read value in the \*outval buffer.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_get\_capabilities(struct wal\_worker \*wrk, int pbid, unsigned int \*outval);

wal\_bce\_jk\_get\_lic() - implementation of the get\_lic() function for the BCE\_JK families

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware (WAL\_PBID\_...). \*outval is a pointer to an output buffer for read license.

The function reads the license from the worker. For BCE\_JK families the license is a 2bit license down-counter contained in the value returned by accelerator operation WAL\_BCE\_JK\_VVER. The 2bit license counter is returned in the \*outval buffer.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_get\_license(struct wal\_worker \*wrk, int pbid, unsigned int \*outval);

All worker abstraction layer API functions listed in  $Table\ 7$  are precompiled into the MicroBlaze library **wal.a** and declared in MicroBlaze header files wal.h and **wal\_bce\_jk.h**.

The worker abstraction layer API functions listed in *Table 7* support instantiation of several (more than 1) instances of the 8xSIMD EdkDSP IP core.





### 5. EdkDSP IP Core – Integration with dual core ARM A9 Linux

The 8xSIMD EdkDSP IP core is integrated in a tester system with architecture presented in *Figure 2* and photo of the HW presented in *Figure 1* and *Figure 5*.

The dual core ARM Cortex A9 system runs configured PetaLinux 2017.4.1 operating system and supports:

- Ethernet 1 Gbit
- SSH, telnet, FTP, ...
- The system image is located on SD card. After the initial boot, the file system is decompressed to the RAM FS in DDR3. The SD card file system is mounted and visible in the running PetaLinux.
- Symmetrical multiprocessing on two ARM A9 processors
- SDSoC 2017.4.1 generated HW accelerators with data movers based on:
  - o Simple DMA with HW supported data movers (DMA data width 32bit or 64bit) with no ARM interrupts. Simple DMA requires allocation of continuous memory space.
  - SG DMA with data movers (DMA data width 32bit or 64bit) with ARM interrupts. SG DMA can
    work with continuous allocation of memory or with standard Linux allocation of memory,
    where the continuous allocation is not guaranteed.
  - o HW data movers connected to the advanced cache coherent port resolving in HW the cache coherency of dual core ARM access and data mover access to DDR3.

The MicroBlaze processor and the 8xSIMD EdkDSP IP core require initialisation and synchronisation with Linux and the dual core ARM subsystem. This is arranged by the following configuration of reserved DDR3 memory (1 GB)

Memory Area (in Bytes) Description Size 0x0000 0000 ... 0x27FF FFFF 640 M Byte Memory managed by standard Linux memory allocation mechanism. Used by dual core Arm A9 symmetrical multiprocessing 32 bit Linux 0x2800 0000 ... 0x280F FFFF Reserved for MicroBlaze – ARM communication 1 M Byte It is continuous memory reserved in Linux configuration 0x2800 0000 ... 0x2810 0FFF 4 kByte Reserved for PicoBlaze6 f0 firmware (MicoBlaze and ARM) 0x2800 1000 ... 0x2810 1FFF 4 kByte Reserved for PicoBlaze6 f1 firmware (MicoBlaze and ARM) 0x2800 2000 ... 0x2810 2FFF 4 kByte Reserved for PicoBlaze6 f2 firmware (MicoBlaze and ARM) 0x2800 3000 ... 0x2810 3FFF 4 kByte Reserved for PicoBlaze6 f3 firmware (MicoBlaze and ARM) 0x2800 4000 ... 0x281F FFFF Reserved for 8xSIMD EdkDSP data (MicoBlaze and ARM) Reserved 0x2810 0000 ... 0x29FF FFFF MicroBlaze program & data. Microblaze processor IP is 15 M Byte configured for execution of its code from 0x28100000. It is a part of the continuous memory reserved in Linux. Continuous memory reserved for video frame buffers. 0x2A00 0000 ... 0x2FFF FFFF 112 M Byte 0x3000 0000 ... 0x3FFF FFFF 256 M Byte Memory reserved for SDSoC data mover and DMA drivers.

*Table 8: Organisation of DDR3 memory* 

Linux user application uses the four reserved 4k Byte areas for copy of four PicoBlaze6 firmware programs. These programs can be compiled on the dual core ARM A9 from the C and ASM source codes stored as asci files on the mounted SD card file system. Compiled firmware programs are read by the user application running on ARM from the SD card files and copied as data to the reserved 4kB continuous memory areas. MicroBlaze program (after HW mutex based synchronisation) reads this data and uses them for programming of PicoBlaze6 FSM of the 8xSIMD EdkDSP IP.

signal processing



## 6. Setup of Hardware

HW setup is based on components [1], [2], [3], [4], [5] designed and manufactured by company Trenz Electronic:

**TE0720-03-2IF**; Part: XC7Z020-2CLG484I; 1 GByte DDR; Industrial Grade (Tj = -40°C to +100°C) [1]. **TE0720-03-1QF**; Part: XA7Z020-1CLG484Q; 1 GByte DDR; Automotive Grade (Tj = -40°C to +125°C) [1]. **TE0720-03-214S-1C**; Part: XC7Z014S-1CLG484C; 1 GByte DDR; Industrial Grade (Tj = 0°C to +85°C) [1]. **Heatsink for TE0720**, spring-loaded embedded [2]. The heatsink serves for the passive cooling of Zynq module. **TE0706-02 Carrier Board** from Trenz Electronic [3]. Board targets extension with second Ethernet in the Zynq PL. **TE0703-05 Carrier Board** from Trenz Electronic [3]. Board targets wide I/O with pre-processing in a Lattice FPGA. **Pmod USBUART** Serial converter & interface [4]. Serves for output from MicroBlaze to PC console via PC USB. **TE0790-02 XMOD FTDI JTAG Adapter** - Xilinx compatible [5]. Supports console and Jtag in case of TE0706-02.

The technical reference manuals (TRM) of the TE0720-03-2IF, TE0720-03-1QF and TE0720-03-214S-1C modules can be downloaded from [1] and TRM for carrier board TE0706-02 or TE0703-05 can be downloaded from [3].



Figure 5: TE0706-02; TE0720-03-14S-1C; USBUART and XMOD FTDI JTAG adapter

18/58



#### Configuration of switches and jumpers on carrier boards TE0703-05 and TE0706-02

#### Configuration of TE0703-05 board

- Set jumpers of the TE0703-05 board to VCCIOA=3.3V; VCCIAOB=1.8V; VCCIOC=3.3V; VCCIOD=3.3V by:
   J5: connect 2-3; J8: connect 1-2; J9: connect 2-3; J10: connect 2-3
- Set switch **S1** of the **TE0706-02** board to:

1=OFF; 2=ON; 3=ON; 4=ON

#### Configuration of TE0706-02 board

- Set jumpers of the TE0706-02 board to generate VCCIOA=3.3V; VCCIOC=3.3V; VCCIOD=3.3V by J10: connect 2-3; J11: connect 2-3; J12: connect 2-3
   In case of the TE0706-02 board the VCCIAOB=1.8V is set directly on the PCB (no dedicated jumper).
- Set switch S1 of the **TE0706-02** board to:

1=ON; 2=ON; 3=ON; 4=OFF

#### Configuration of TE0790-02 xmod adapter

The TE0706-02 board ARM serial terminal/JTAG is connected to the PC by a Mini USB (type B) cable via the **TE0790-02** XMOD FTDI JTAG adapter [5]. See *Figure 1* and *Figure 5*.

- Set switch in the XMOD module to:
- 1=ON; 2=OFF; 3=ON; 4=OFF;

The jumper on the USBUART pmod is set to the default: connect **1c1-vcc**. With this setup, the USBUART pmod convertor chip is powered from the PC 5V USB source. The TE0790-02 xmod adapter generates its local 3.3V power supply by an on-module DC2DC power converter. See *Figure 1* and *Figure 5*.

#### Configuration of USBUART pmod adapter

The serial terminal for MicroBlaze is connected to the PC by a Micro USB cable via the USBUART pmod adapter. The J6 connector on the TE0706-02 and J2 connector on the TE0703-05 have three lines of 32 pins named:

[A1 A2 A3 A4 A5 A6 ... A32] [B1 B2 B3 B4 B5 B6 ... B32] [C1 C2 C3 C4 C5 C6 ... C32]

In case of the TE0706-02 board, the USBUART pmod is connected to pins [B1 ... B6] of connector J6B (central line B). See Table 9, *Figure 6* and the concrete implemented solution on *Figure 5*:

Table 9: Connection of USBUART to TE0706-02

| TE0706-02 | USBUART  | Name | Function                                                                 |
|-----------|----------|------|--------------------------------------------------------------------------|
| J6 pin B1 | J2 pin 6 | 3.3V | Disconnected by USBUART jumper. Power for USBUART from PC USB 5V         |
| J6 pin B2 | J2 pin 5 | GND  | Ground                                                                   |
| J6 pin B4 | J2 pin 3 | TXD  | FPGA Pin: AB2; FPGA design net: uart_pmod_tx; Direction: from PC to FPGA |
| J6 pin B5 | J2 pin 2 | RXD  | FPGA pin: U5; FPGA design net: uart_pmod_rx; Direction: from FPGA to PC  |

In case of the TE0703-05, the USBUART pmod can be also to pins [B1 ... B6] of the connector J2 if the communication from PC to MicroBlaze is not needed. If needed, use a custom cable. See Table 10 and Figure 7.

Table 10: Connection of USBUART to TE0703-05

| TE0703-05        | USBUART  | Name | Function                                                                 |
|------------------|----------|------|--------------------------------------------------------------------------|
| J2 pin B1        | J2 pin 6 | 3.3V | Disconnected by USBUART jumper. Power for USBUART from PC USB 5V         |
| J2 pin B2        | J2 pin 5 | GND  | Ground                                                                   |
| J2 pin <b>C3</b> | J2 pin 3 | TXD  | FPGA Pin: AB2; FPGA design net: uart_pmod_tx; Direction: from PC to FPGA |
| J2 pin B5        | J2 pin 2 | RXD  | FPGA pin: U5; FPGA design net: uart_pmod_rx; Direction: from FPGA to PC  |

signal processing





- 1. 5V power connector jack, J1
- 2. Reset switch, S2
- 3. USB2.0 type A receptacle, J7
- 4. Micro SD card socket with Card Detect, J4
- 5. 50 pin IDC male connector, J5
- 6. 1000Base-T Gigabit RJ45 Ethernet MagJack, J3
- 7. 1000Base-T Gigabit RJ45 Ethernet MagJack, J2
- 8. XMOD JTAG- / UART-header, JX1
- 9. User DIP-switch, S1
- 10. VCCIO selection jumper block, J10 J12
- 11. External connector (VG96) placeholder, J6
- 12. Samtec Razor Beam™ LSHM-150 B2B connector, JB1
- 13. Samtec Razor Beam™ LSHM-150 B2B connector, JB2
- 14. Samtec Razor Beam™ LSHM-130 B2B connector, JB3

Figure 6: TE0706-02 Carrier Board.

Figure 6 presents main components and connector locations of the TE0706-02 Carrier Board [3]. The evaluation package released together with this application note supports single 1000Base-T Gigabit RJ45 Ethernet MagJack, J3 as Arm A9 PetaLinux eth0. See Figure 6. Output path from MicroBlaze to PC and input path from the PC keyboard to MicroBlaze is supported by USBUART connected directly to the connector J6: B1...B6 pins. See <a href="https://wiki.trenz-electronic.de/display/PD/TE0706+TRM">https://wiki.trenz-electronic.de/display/PD/TE0706+TRM</a> for source of the photo and for detailed description of the TE0706-02 carrier board.





- 1. Samtec Razor Beam™ LSHM-150 B2B connector, JB1
- 2. Samtec Razor Beam™ LSHM-150 B2B connector, JB2
- 3. Samtec Razor Beam™ LSHM-130 B2B connector, JB3
- 4. Micro SD card socket with detect switch, J3
- 5. LED indicators D1 and D2
- 6. Mini-USB type B connector, J4
- 7. LED indicators D3 and D4
- 8. Configuration DIP switches, S2
- 9. User push button (Reset), S1
- 10. External connector (VG96) placeholder, J1
- 11. External connector (VG96) placeholder, J2
- 12. VCCIO voltage selection jumper block, J5, J8, J9 and J10
- 13. Trxcom 1000Base-T Gigabit RJ45 Magjack, J14
- 14. USB type A receptacle, J6 (optional micro USB 2.0 type B receptacle available, J12)
- 15. 5V power connector jack, J13

Figure 7: TE0703-05 Carrier Board.

Figure 7 presents main components and connector locations of the TE0703-05 Carrier Board [3]. The precompiled designs can be used without modification on the TE0703-05. Output path from MicroBlaze to PC is supported if the USBUART is connected to the **J2: B1...B6** pins directly. Output path from MicroBlaze to PC and input path from the PC keyboard to MicroBlaze is supported only if the USBUART is connected to the **J2: B1 B2 C3 B5** pins indirectly (via a custom made cable). See <a href="https://wiki.trenz-electronic.de/display/PD/TE0703+TRM">https://wiki.trenz-electronic.de/display/PD/TE0703+TRM</a> for source of the photo and for description of the **TE0703-05** carrier board.

21/58



### 7. Reference Application for the 8xSIMD EdkDSP IP Core

The reference application is the active acoustics noise cancellation for the hands free telephony.

The near end signal e(i) (voice of a speaker) is disturbed by a disturbance signal received by the near end microphone. This unknown disturbance y(i) is generated by a known (measured) far end signal (example: noise from the motor engine) u(i). The objective of the active acoustics noise cancellation is to use the measured disturbed near end microphone signal d(i) and the signal measured by the far end microphone u(i) for reconstruction of the near end speaker signal e(i) with cancelled disturbance.

The transfer function from the far end (known) source of the disturbance is modelled by a recursive FIR filter with 2000 coefficients with sampling rate 75 kHz.

#### **Recursive FIR filter algorithm:**

Objective of FIR filter is to generate sequence of modelled system outputs d(i) based on the sequence of system inputs u(i) and constant vector of N FIR filter coefficients. The generated output sequence includes also the random additive output noise defined by white noise signal e(i).

```
x(i) = u(i)

y(i) = [w(1), w(2), ..., w(N)] * [x(i), x(i-1), ... x(i-N+1)]^T

d(i) = y(i) + e(i)
```

#### **Recursive adaptive LMS filter algorithm:**

Objective of adaptive LMS filter is to identify recursively an unknown vector of N=2000 FIR filter coefficients from a sequence of system inputs u(i) and system outputs d(i) with sampling rate 75 kHz. The algorithm works under an assumption that the measured output sequence d(i) has been generated by a FIR filter with unknown coefficients with dimension N=2000 and includes also the unknown random white noise signal. Signal e(i) is estimated by the adaptive LMS filter.

```
x(i) = u(i)

y(i) = [w(1), w(2), ..., w(N)] * [x(i), x(i-1), ... x(i-N+1)]^T

e(i) = d[i]-y[i]

[w(1), w(2), ..., w(N)] = [w(1), w(2), ..., w(N)] + mu * e(i) * [x(i), x(i-1), ... x(i-N+1)]
```

Where N is order of the FIR and LMS filter. N = 2000 in the implemented designs.

- u(i) is scalar, floating point input to the system
- d(i) is scalar, floating point output of a system
- y(i) is scalar, floating point output of FIR filter
- e(i) is scalar, floating point prediction error

[w(1), w(2), ..., w(N)] is vector of N scalar, floating point FIR filter coefficients, N=2000.

mu is scalar, floating point constant used for control of the speed of convergence of the adaptive LMS filter.

#### The 8xSIMD EdkDSP IP Core

The 8xSIMD EdkDSP IP Core is configured for accelerated floating point computation of the recursive FIR filter with constant parameters N=2000 and for acceleration of the adaptive recursive LMS filter with N=2000 unknown coefficients with required sustained sampling frequency 75 kHz. The FIR filter models the environment and generates the sequence of u(i), d(i) data measurements.

22/58





The LMS filter serves for reconstruction of the unknown e(i) sequence – the speaker voice with partially cancelled disturbance from the far distance source. Requirements and main implementation results (for the floating point FIR & LMS filter implementation on the 8xSIMD EdkDSP IP) are listed in *Table 11*.

Table 11: Requirements and results.

| Parameter (Module TE0720-03-2IF)        | Requirement | SW MicroBlaze 100 MHz     | 8xSIMD EdkDSP 120 MHz     |
|-----------------------------------------|-------------|---------------------------|---------------------------|
|                                         |             | Requirements met (YES/NO) | Requirements met (YES/NO) |
| FIR filter sampling rate Order N=2000   | 75 kHz      | 2.25 kHz (NO)             | 279.70 kHz (YES)          |
| FIR sustained performance (MFLOPs)      | 300 MFLOPs  | 9 MFLOPs (NO)             | 1119 MFLOPs (YES)         |
| LMS filter sampling rate Order N=2000   | 75 kHz      | 1.125 kHz (NO)            | 90.75 KHz (YES)           |
| LMS sustained performance (MFLOPs)      | 600 MFLOPs  | 9 MFLOPs (NO)             | 728 MFLOPs (YES)          |
| Parameter (Modules                      | Requirement | SW MicroBlaze 100 MHz     | 8xSIMD EdkDSP 100 MHz     |
| TE0720-03-1QF, TE0720-03-14S-1C)        |             | Requirements met (YES/NO) | Requirements met (YES/NO) |
| FIR filter sampling rate Order N=2000   | 75 kHz      | 2.25 kHz (NO)             | 244.4 kHz (YES)           |
| FIR sustained performance (MFLOPs)      | 300 MFLOPs  | 9 MFLOPs (NO)             | 978 MFLOPs (YES)          |
| LMS filter sampling rate Order N=2000   | 75 kHz      | 1.125 kHz (NO)            | 77.03 KHz (YES)           |
| LMS sustained performance (MFLOPs)      | 600 MFLOPs  | 9 MFLOPs (NO)             | 618 MFLOPs (YES)          |
| Bit exact identical results for 8xSIMD  | Required    | YES                       | YES                       |
| EdkDSP IP and MB (FIR and LMS)          |             |                           |                           |
| Parallel EdkDSP computation and data    | Required    | YES                       | YES                       |
| transfers to/from DDR3 by MicroBlaze    |             |                           |                           |
| Runtime change of 8xSIMD EdkDSP IP      | Required    | NA                        | YES                       |
| Embedded 8xSIMD EdkDSP C compiler       | Required    | NA                        | YES                       |
| Compatibility with SDSoC 2017.4.1       | Required    | YES                       | YES                       |
| Compatibility with PetaLinux 2017.4.1   | Required    | YES                       | YES                       |
| Compatibility with free SDK 2017.4.1    | Required    | YES                       | YES                       |
| and free edition of Vivado HLS 2017.4.1 |             |                           |                           |

#### Summary of main results related to the performance of the 8xSIMD EdkDSP IP:

- The required LMS filter sampling rate 75 KHz (with N=2000) is reached for the TE0720-03-2IF module.
- The maximum sampling rate is 90.75 kHz for the adaptive LMS filter and 279.7 kHz for the FIR filter on the TE0720-03-2IF module with the 120 MHz 8xSIMD EdkDSP.
- The sustained floating-point performance of the 120 MHz 8xSIMD EdkDSP on TE0720-03-2IF module is 728 MFLOPs in case of the adaptive LMS filter and 1119 MFLOPs in case of the FIR filter.
- The maximum sampling rate is 77.03 kHz for the adaptive LMS filter and 244.4 kHz for the FIR filter on the on TE0720-03-1QF or TE0720-03-14S-1C module with the 100 MHz 8xSIMD EdkDSP.
- The sustained floating-point performance of the 100 MHz 8xSIMD EdkDSP on TE0720-03-1QF or TE0720-03-14S-1C module is 618 MFLOPs in case of the adaptive LMS filter and 978 MFLOPs in case of the FIR filter.
- The 8xSIMD EdkDSP is controlled from the 100 MHz MicroBlaze processor and operates in parallel to the Cortex A9 processor(s).
- The 8xSIMD EdkDSP operates in parallel to each of the 21 Linux examples and 19 standalone examples
  of HW accelerators generated from selected Cortex A9 C/C++ functions in the Xilinx SDSoC 2017.4.1
  design environment.
- The embedded C/ASM compiler utilities for the 8xSIMD EdkDSP accelerator run as Linux applications on the Arm Cortex A9 processor. These utilities can re-compile new EdkDSP firmware from the modified C/ASM source code in the runtime.



# 8. Installation and Use of Base Evaluation Package

This chapter describes the installation and use of a base evaluation package. Package is demonstrating:

- In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug of the 8xSIMD EdkDSP IP. ILA works with internal buffer for 8k samples and operates at 100 MHz (1qf and 14s device) and 120 MHz (2if device). See *Figure 9*, *Figure 10*, *Figure 11*, *Figure 12*.
- The standalone examples support ILA and additionally can display the on-chip temperature via JTAG. See *Figure 13*
- Embedded Compilation from a C/ASM source code to firmware for the reprogrammable PicoBlaze6 finite state machine (FSM) scheduling inside of the 8xSIMD EdkDSP IP core the floating point computation sequences performed in the 8xSIMD data flow unit (DFU).

  This embedded compilation is supported for the Linux examples. See *Figure 14 Figure 15*, *Figure 16*.
- There is no need to install Xilinx SDK 2017.4.1, Xilinx Vivado 2017.4.1 tools or Xilinx SDSoC 2017.4.1.
- The In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug can be performed from the free Xilinx Lab Vivado 2017.4.1 tool installed on Win7 (64bit) or Win 10 (64bit) PC
- The Linux target examples support 1GBit Ethernet, SSH telnet and file system management tools like the Total Commander for an ftp based access from PC to the SD card files.

The base evaluation package provides 21 demos for the Linux target and the 19 precompiled demos for the standalone target. *Table 12* describes demos, PL resources and the HW/SW SDSoC 2017.4.1. acceleration data.





Table 12: Description of ARM SDSoC acceleration examples compatible with 8xSIMD EdkDSP IP

| Linux        | Standalone   | Description of ARM SDSoC acceleration examples. All examples are extended versions                                |
|--------------|--------------|-------------------------------------------------------------------------------------------------------------------|
| HW/SW        | HW/SW        | of the Xilinx GitHub SDSoC 2017.1 examples. SW extensions support the initialisation                              |
| Acceleration | Acceleration | of the MicroBlaze processor and the 8xSIMD EdkDSP IP core.                                                        |
| te01_l       | te01_s       | array_partition - This example shows how to use array partitioning to improve                                     |
|              |              | performance of a hardware function. It performs int32 matrix multiplication                                       |
|              |              | C[32,32]= A[32,32]*B[32,32]                                                                                       |
| 2if: 3.39x   | 2if: 6.62x   | 150 MHz Slices: 63.20% Luts: 44.36% Registers: 23.69% BRAMs: 76.79% DSPs: 54.55%                                  |
| 1qf: 4.40x   | 1qf: 7.29x   | 150 MHz Slices: 65.14% Luts: 43.27% Registers: 26.00% BRAMs: 76.79% DSPs: 54.55%                                  |
| 14s: 4.46x   | 14s: 7.17x   | 150 MHz Slices: 63.71% Luts: 56.60% Registers: 34.05% BRAMs: 89.25% DSPs: 70.59%                                  |
| te02_l       | te02_s       | <b>burst_rw</b> - This is simple example of using AXI4-master interface for burst read and write.                 |
| 2if:         | 2if:         | 150 MHz Slices: 56.80% Luts: 38.86% Registers: 21.14% BRAMs: 51.43% DSPs: 9.55%                                   |
| 1qf:         | 1qf:         | 150 MHz Slices: 55.72% Luts: 38.89% Registers: 21.14% BRAMs: 51.43% DSPs: 9.55%                                   |
| 14s:         | 14s:         | 150 MHz Slices: 54.65% Luts: 50.87% Registers: 27.67% BRAMs: 56.07% DSPs: 12.35%                                  |
| te03_l       | te03_s       | <b>custom_data_type</b> - This is a simple example of RGB to HSV conversion to demonstrate                        |
|              |              | Custom Data Type usage in hardware accelerator. Xilinx HLS compiler supports custom                               |
|              |              | data type to operate within the hardware function and also it acts as a memory                                    |
|              |              | interface between PL to DDR3.                                                                                     |
| 2if: 22.48x  | 2if: 25.16x  | 150 MHz Slices: 60.69% Luts: 42.18% Registers: 22.93% BRAMs: 51.43% DSPs: 10.91%                                  |
| 1qf: 25.43x  | 1qf: 28.94x  | 150 MHz Slices: 59.81% Luts: 42.21% Registers: 23.07% BRAMs: 51.43% DSPs: 10.91%                                  |
| 14s: 25.88x  | 14s: 28.88x  | 150 MHz Slices: 59.32% Luts: 55.23% Registers: 30.07% BRAMs: 56.07% DSPs: 14.12%                                  |
| te04_l       | te04_s       | data_access_random - This is a simple example of int32 matrix multiplication                                      |
|              | _            | (Row x Col) C[32,32]= A[32,32]*B[32,32] to demonstrate random data access pattern.                                |
| 2if: 0.57x   | 2if: 0.57x   | 120 MHz Slices: 65.63% Luts: 43.55% Registers: 25.34% BRAMs: 56.43% DSPs: 13.64%                                  |
| 1qf: 0.63x   | 1qf: 0.63x   | 120 MHz Slices: 64.33% Luts: 43.58% Registers: 25.35% BRAMs: 56.43% DSPs: 13.64%                                  |
| 14s: 0.63x   | 14s: 0.63x   | 120 MHz Slices: 64.60% Luts: 57.02% Registers: 33.19% BRAMs: 62.62% DSPs: 17.65%                                  |
| te05_l       | te05_s       | dependence_inter - This is a simple example to demonstrate inter dependence                                       |
|              | _            | attribute. Using inter dependence attribute user can provide additional dependency                                |
|              |              | details to compiler which allow compiler to perform unrolling/pipelining to get better                            |
|              |              | performance.                                                                                                      |
| 2if: 5.84x   | 2if: 6.51x   | 150 MHz Slices: 58.66% Luts: 40.36% Registers: 22.57% BRAMs: 55.00% DSPs: 22.27%                                  |
| 1qf: 6.42x   | 1qf: 7.16x   | 150 MHz Slices: 59.05% Luts: 40.30% Registers: 22.80% BRAMs: 55.00% DSPs: 22.27%                                  |
| 14s: 6.60x   | 14s: 7.22x   | 150 MHz Slices: 58.81% Luts: 52.72% Registers: 29.85% BRAMs: 60.75% DSPs: 28.82%                                  |
| te06_l       | te06_s       | direct_connect - This is a simple example of int32 matrix multiplication with matrix                              |
|              |              | addition (Out[50,50] = ( A[50,50] * B[50,50] ) + C[50,50] ) to demonstrate direct                                 |
|              |              | connection which helps to achieve increasing in system parallelism and concurrency.                               |
| 2if: 8.61x   | 2if: 9.14x   | 150 MHz Slices: 75.00% Luts: 49.24% Registers: 29.73% BRAMs: 82.50% DSPs: 57.73%                                  |
| 1qf: 8.36x   | 1qf: 8.92x   | 120 MHz Slices: 72.65% Luts: 49.21% Registers: 29.73% BRAMs: 82.50% DSPs: 57.73%                                  |
| 14s: 9.55x   | 14s: 9.92x   | 150 MHz Slices: 74.31% Luts: 62.99% Registers: 40.41% BRAMs: 96.73% DSPs: 74.71%                                  |
| te07_l       | te07_s       | dma_sg - This example demonstrates how to use Scatter-Gather DMAs for data transfer to/from hardware accelerator. |
| 2if:         | 2if:         | 150 MHz Slices: 73.83% Luts: 48.92% Registers: 29.41% BRAMs: 60.00% DSPs: 9.55%                                   |
| 1qf:         | 1qf:         | 120 MHz Slices: 72.84% Luts: 48.94% Registers: 29.41% BRAMs: 60.00% DSPs: 9.55%                                   |
| 14s:         | 14s:         | 150 MHz Slices: 74.14% Luts: 64.08% Registers: 38.59% BRAMs: 67.29% DSPs: 12.35%                                  |
| te08_l       | te08_s       | dma_simple - This example demonstrates how to insert Simple DMAs for data transfer                                |
|              |              | between user program and hardware accelerator.                                                                    |
| 2if:         | 2if:         | 150 MHz Slices: 63.16% Luts: 43.06% Registers: 24.72% BRAMs: 56.43% DSPs: 9.55%                                   |
| 1qf:         | 1qf:         | 150 MHz Slices: 64.74% Luts: 43.07% Registers: 24.78% BRAMs: 56.43% DSPs: 9.55%                                   |
| 14s: depart  | tment o14s:  | 150 MHz Slices: 62.58% Luts: 56.34% Registers: 32.44% BRAMs: 62.62% DSPs: 12.35%                                  |
| signal pro   |              |                                                                                                                   |

| te09_l<br>(With Linux<br>SD file R/W<br>functions) | Not imple-<br>mented as<br>standalone | <b>file_io_manr_sobel</b> - Linux video processing application that reads input video from a file and writes out the output video to a file. Video processing includes Motion Adaptive Noise Reduction (MANR) followed by a Sobel filter for edge detection. You can run it by supplying a 1080p YUV422 file as input with limiting number of frames to a maximum of 20 frames. |  |
|----------------------------------------------------|---------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| 2if:                                               | NA                                    | 120 MHz Slices: 75.92% Luts: 51.33% Registers: 30.58% BRAMs: 63.21% DSPs: 10.91%                                                                                                                                                                                                                                                                                                |  |
| 1qf:                                               | NA                                    | 120 MHz Slices: 75.98% Luts: 51.32% Registers: 30.66% BRAMs: 63.21% DSPs: 10.91%                                                                                                                                                                                                                                                                                                |  |
| 14s:                                               | NA                                    | 120 MHz Slices: 76.30% Luts: 67.15% Registers: 40.15% BRAMs: 71.50% DSPs: 12.62%                                                                                                                                                                                                                                                                                                |  |
| te10 l                                             | Not imple-                            | file_io_optical - Linux video processing application that reads input video from a file                                                                                                                                                                                                                                                                                         |  |
| (With Linux                                        | mented as                             | and writes out the output video to a file. Video processing performs LK Dense Optical                                                                                                                                                                                                                                                                                           |  |
| SD file R/W                                        | standalone                            | Flow over two Full HD frames video file. You can run it by supplying a 1080p YUV422 file                                                                                                                                                                                                                                                                                        |  |
| functions)                                         |                                       | route85_1920x1080.yuv as input.                                                                                                                                                                                                                                                                                                                                                 |  |
| 2if:                                               | NA                                    | 120 MHz Slices: 99.50% Luts: 79.55% Registers: 51.34% BRAMs: 88.93% DSPs: 40.91%                                                                                                                                                                                                                                                                                                |  |
| 1qf:                                               | NA                                    | 100 MHz Slices: 98.89% Luts: 79.62% Registers: 50.90% BRAMs: 88.93% DSPs: 40.91%                                                                                                                                                                                                                                                                                                |  |
| 14s:                                               | NA                                    | SW impl. Slices: 43.88% Luts: 40.25% Registers: 19.66% BRAMs: 50.00% DSPs: 12.35%                                                                                                                                                                                                                                                                                               |  |
| te11 l                                             | te11 s                                | full_array_2d - This is a simple example of accessing full data from 2D array.                                                                                                                                                                                                                                                                                                  |  |
| 2if:                                               | 2if:                                  | 150 MHz Slices: 60.20% Luts: 41.98% Registers: 23.00% BRAMs: 55.36% DSPs: 12.27%                                                                                                                                                                                                                                                                                                |  |
| 1qf:                                               | 1qf:                                  | 150 MHz Slices: 59.52% Luts: 41.91% Registers: 23.09% BRAMs: 55.36% DSPs: 12.27%                                                                                                                                                                                                                                                                                                |  |
| 14s:                                               | 14s:                                  | 150 MHz Slices: 59.86% Luts: 54.84% Registers: 30.23% BRAMs: 61.21% DSPs: 15.88%                                                                                                                                                                                                                                                                                                |  |
| te12_l                                             | te12_s                                | hello_vadd - This is a basic hello world kind of example which demonstrates how to                                                                                                                                                                                                                                                                                              |  |
|                                                    |                                       | achieve vector addition using hardware function.                                                                                                                                                                                                                                                                                                                                |  |
| 2if:                                               | 2if:                                  | 150 MHz Slices: 60.06% Luts: 41.46% Registers: 22.59% BRAMs: 53.21% DSPs: 9.55%                                                                                                                                                                                                                                                                                                 |  |
| 1qf:                                               | 1qf:                                  | 150 MHz Slices: 58.95% Luts: 41.48% Registers: 22.59% BRAMs: 53.21% DSPs: 9.55%                                                                                                                                                                                                                                                                                                 |  |
| 14s:                                               | 14s:                                  | 150 MHz Slices: 57.40% Luts: 54.29% Registers: 29.57% BRAMs: 58.41% DSPs: 12.35%                                                                                                                                                                                                                                                                                                |  |
| te13_l                                             | te13 s                                | Imem_2rw - This is a simple example of vector addition to demonstrate how to utilize                                                                                                                                                                                                                                                                                            |  |
| 1015_1                                             | 1015_5                                | both ports of Local Memory.                                                                                                                                                                                                                                                                                                                                                     |  |
| 2if:                                               | 2if:                                  | 150 MHz Slices: 61.30% Luts: 42.13% Registers: 23.02% BRAMs: 55.36% DSPs: 9.55%                                                                                                                                                                                                                                                                                                 |  |
| 1qf:                                               | 1qf:                                  | 150 MHz Slices: 61.26% Luts: 42.14% Registers: 23.02% BRAMs: 55.36% DSPs: 9.55%                                                                                                                                                                                                                                                                                                 |  |
| 14s:                                               | 14s:                                  | 150 MHz Slices: 59.48% Luts: 55.12% Registers: 30.13% BRAMs: 61.21% DSPs: 12.35%                                                                                                                                                                                                                                                                                                |  |
| te14_l                                             | te14 s                                | loop_fusion - This example will demonstrate how to fuse two loops into one to improve                                                                                                                                                                                                                                                                                           |  |
|                                                    |                                       | the performance of a C/C++ hardware function.                                                                                                                                                                                                                                                                                                                                   |  |
| 2if:                                               | 2if:                                  | 150 MHz Slices: 61.41% Luts: 42.73% Registers: 23.57% BRAMs: 53.21% DSPs: 15.00%                                                                                                                                                                                                                                                                                                |  |
| 1qf:                                               | 1qf:                                  | 150 MHz Slices: 60.62% Luts: 42.72% Registers: 23.79% BRAMs: 53.21% DSPs: 15.00%                                                                                                                                                                                                                                                                                                |  |
| 14s:                                               | 14s:                                  | 150 MHz Slices: 60.64% Luts: 55.86% Registers: 31.14% BRAMs: 58.41% DSPs: 19.41%                                                                                                                                                                                                                                                                                                |  |
| te15_l                                             | te15_s                                | loop_perfect - This nearest neighbor example is to demonstrate how to achieve better                                                                                                                                                                                                                                                                                            |  |
|                                                    | 1020_0                                | performance using perfect loop.                                                                                                                                                                                                                                                                                                                                                 |  |
| 2if:                                               | 2if:                                  | 150 MHz Slices: 75.26% Luts: 53.49% Registers: 29.28% BRAMs: 53.21% DSPs: 15.45%                                                                                                                                                                                                                                                                                                |  |
| 1qf:                                               | 1qf:                                  | 150 MHz Slices: 73.72% Luts: 53.43% Registers: 29.51% BRAMs: 53.21% DSPs: 15.45%                                                                                                                                                                                                                                                                                                |  |
| 14s:                                               | 14s:                                  | 150 MHz Slices: 74.11% Luts: 69.93% Registers: 38.63% BRAMs: 58.41% DSPs: 20.00%                                                                                                                                                                                                                                                                                                |  |
| te16_l                                             | te16 s                                | loop_pipeline - This example demonstrates how loop pipelining can be used to improve                                                                                                                                                                                                                                                                                            |  |
| _                                                  | _                                     | the performance of a hardware function.                                                                                                                                                                                                                                                                                                                                         |  |
| 2if:                                               | 2if:                                  | 150 MHz Slices: 60.06% Luts: 41.46% Registers: 22.59% BRAMs: 53.21% DSPs: 9.55%                                                                                                                                                                                                                                                                                                 |  |
| 1qf:                                               | 1qf:                                  | 150 MHz Slices: 58.95% Luts: 41.48% Registers: 22.59% BRAMs: 53.21% DSPs: 9.55%                                                                                                                                                                                                                                                                                                 |  |
| 14s:                                               | 14s:                                  | 150 MHz Slices: 57.40% Luts: 54.29% Registers: 29.57% BRAMs: 58.41% DSPs: 12.35%                                                                                                                                                                                                                                                                                                |  |
| te17_l                                             | te17_s                                | loop_reorder - This is a simple example of matrix multiplication (Row x Col) to                                                                                                                                                                                                                                                                                                 |  |
| _                                                  | _                                     | demonstrate how to achieve better pipeline II factor by loop reordering.                                                                                                                                                                                                                                                                                                        |  |
| 2if: 4.27x                                         | 2if: 7.12x                            | 150 MHz Slices: 68.44% Luts: 45.64% Registers: 26.45% BRAMs: 76.79% DSPs: 56.36%                                                                                                                                                                                                                                                                                                |  |
| 1qf: 4.66x                                         | 1qf: 7.72x                            | 150 MHz Slices: 67.90% Luts: 44.64% Registers: 27.53% BRAMs: 76.79% DSPs: 56.36%                                                                                                                                                                                                                                                                                                |  |
| 14s: 4.92x                                         | 14s: 7.85x                            | 150 MHz Slices: 66.91% Luts: 58.41% Registers: 36.04% BRAMs: 89.25% DSPs: 72.94%                                                                                                                                                                                                                                                                                                |  |
|                                                    |                                       | ğ a lanının                                                                                                                                                                                                                                                                 |  |
| depar                                              | tment of                              |                                                                                                                                                                                                                                                                                                                                                                                 |  |

signal processing



| te18_l      | te18_s      | shift_register - This example demonstrates how to shift values in each clock cycle. |
|-------------|-------------|-------------------------------------------------------------------------------------|
| 2if: 1.96x  | 2if: 4.19x  | 150 MHz Slices: 63.03% Luts: 42.68% Registers: 24.19% BRAMs: 53.21% DSPs: 24.55%    |
| 1qf: 2.02x  | 1qf: 4.54x  | 150 MHz Slices: 62.23% Luts: 42.40% Registers: 24.52% BRAMs: 53.21% DSPs: 24.55%    |
| 14s: 2.10x  | 14s: 4.52x  | 150 MHz Slices: 61.28% Luts: 55.41% Registers: 32.10% BRAMs: 58.41% DSPs: 31.76%    |
| te19_l      | te19_s      | sys_port - This is a simple example which demonstrates sys_port usage.              |
| 2if:        | 2if:        | 120 MHz Slices: 83.92% Luts: 54.55% Registers: 34.77% BRAMs: 65.00% DSPs: 9.55%     |
| 1qf:        | 1qf:        | 120 MHz Slices: 80.92% Luts: 54.53% Registers: 34.77% BRAMs: 65.00% DSPs: 9.55%     |
| 14s:        | 14s:        | 120 MHz Slices: 81.75% Luts: 71.42% Registers: 45.53% BRAMs: 73.86% DSPs: 12.35%    |
| te20_l      | te20_s      | systolic_array - Matrix multiplication implemented as systolic array.               |
| 2if: 0.066x | 2if: 0.162x | 150 MHz Slices: 68.55% Luts: 47.36% Registers: 26.67% BRAMs: 53.21% DSPs: 61.36%    |
| 1qf: 0.077x | 1qf: 0.177x | 150 MHz Slices: 66.75% Luts: 46.26% Registers: 27.82% BRAMs: 53.21% DSPs: 61.36%    |
| 14s: 0.068x | 14s: 0.198x | 150 MHz Slices: 67.15% Luts: 60.51% Registers: 36.42% BRAMs: 58.41% DSPs: 79.41%    |
| te21_l      | te21_s      | wide_memory_rw - Wide memory read write 64 bit wide.                                |
| 2if:        | 2if:        | 150 MHz Slices: 60.07% Luts: 39.74% Registers: 23.34% BRAMs: 55.36% DSPs: 9.55%     |
| 1qf:        | 1qf:        | 150 MHz Slices: 59.34% Luts: 39.77% Registers: 23.34% BRAMs: 55.36% DSPs: 9.55%     |
| 14s:        | 14s:        | 150 MHz Slices: 58.41% Luts: 52.05% Registers: 30.55% BRAMs: 61.21% DSPs: 12.35%    |



#### Installation and use of the Release Evaluation Package - standalone examples

In case of standalone target:

- (1) In Win 7 or Win 10 (32bit or 64bit PC), unzip the basic evaluation package TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL.zip to directory of your choice. We will use: c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\
- (2) Select one of the examples (t01\_s ... t21\_s) and copy the content of sd\_card directory to the SD card. Example: Copy BOOT.bin from c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release \te01\_s\Release\sd\_card\BOOT.bin to the root of the SD card as single file.
- (3) Connect USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect another USB cable to the USBUART pmod module present in the J5 connector to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board and open putty (or similar) terminal client for both USB serial lines. Set the serial communication to: [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.
- (6) Insert SD card to the TE0706-02 or TE0703-05 carrier board.
- (7) Reset the carrier board (S2 button).
  - The standalone system will start. See *Figure 8*.
  - The ARM terminal will present output from the t01\_s example.
  - The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP. See Figure 8.
- (8) In PC, open the Vivado Lab tool 2017.4.1 See Figure 9.

Open Hardware Manager

Press Auto Connect icon in Hardware window

- Open description of debug nets present in file, thus specifying the probes file as:
- $\label{lem:c:TS74TE0720_EdkDSP_2if_te706_ila8k_Release_INSTALL\SDSoC_PFM\2if\SD_release $$ \te01_s\Release\debug_nets.ltx $$
- Set the ILA trigger conditions and observe process of computation in the 8xSIMD EdkDSP IP. See *Figure 10*, *Figure 11*, *Figure 12*.
- Open new perspective and observe the chip temperature. See Figure 13.
- (9) Close Vivado Lab 2017.4.1 tool project.
- (10) Remove SD card and reprogram it in PC to test another example.
- (11) Go to step (6).



```
SoM: TE0720-03-2I F SC REV:05
MAC: D8 80 39 DE 68 E0
ARMCPU0: MB0 reset removed, ARM waiting ...
ARMCPU0: MB0 indicates - running ...
Number of CPU cycles running application in software: 164000
Number of CPU cycles running application in hardware: 24760
Speed up: 6.62359
Note: Speed up is meaningful for real hardware execution only, not for emulation.
TEST PASSED
```

```
_ D X
Putty COM57 - Putty
MB0 : (EdkDSP 8xSIMD) FIR room response ...
                                              1120 MFLOPs
MB0 : (HW FP unit
                    ) Add near-end signal ...
                                              728 MFLOPs
MB0 : (EdkDSP 8xSIMD) LMS Identification ...
MBO : (HW FP unit ) LMS Identification ... 9 MFLOPs
      (EdkDSP 8xSIMD) OK
MB0 : (EdkDSP 8xSIMD) Write firmware ...
MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
MB0 : (EdkDSP 8xSIMD) VZ2A 'worker1' ..... OK
MB0 : (EdkDSP 8xSIMD) VB2A 'worker1'
      (EdkDSP 8xSIMD) VZ2B 'worker1'
       (EdkDSP 8xSIMD) VA2B 'worker1'
       (EdkDSP 8xSIMD) VADD 'worker1'
MB0 : (EdkDSP 8xSIMD) VADD BZ2A 'worker1' .. OK
MB0 : (EdkDSP 8xSIMD) VADD AZ2B 'worker1' .. OK
MB0 : (EdkDSP 8xSIMD) VSUB 'worker1' ..... OK
MB0 : (EdkDSP 8xSIMD) VSUB BZ2A 'worker1'
      (EdkDSP 8xSIMD) VSUB AZ2B 'worker1' .. OK
       (EdkDSP 8xSIMD) VMULT 'worker1' .....
      (EdkDSP 8xSIMD) VMULT BZ2A 'worker1'
MB0 : (EdkDSP 8xSIMD) VMULT AZ2B 'worker1' . OK
MB0 : (EdkDSP 8xSIMD) VPROD 'worker1' ..... OK
MBO : (EdkDSP 8xSIMD) VMAC 'worker1' ..... OK
MBO : (EdkDSP 8xSIMD) VMSUBAC 'worker1' .... OK
MBO : (EdkDSP 8xSIMD) VPROD S8 'worker1' ... OK
MB0 : (EdkDSP 8xSIMD) VDIV 'worker1' ..... OK
MB0 : (EdkDSP 8xSIMD) Write firmware ...
MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
MB0 : (HW FP unit
                    ) Far-end signal ...
MB0 : (EdkDSP 8xSIMD) FIR room response ... 1120 MFLOPs
      (HW FP unit ) Add near-end signal ...
       (EdkDSP 8xSIMD) LMS Identification ... 728 MFLOPs
MB0:
```

Figure 8: Release demo t01 s. ARM and 8xSIMD EdkDSP terminal output.





Figure 9: Release demo t01 s. Vivado Lab Tool is open.

The Vivado Lab tool is connected to the chip. You have to specify the probes file (See Figure 10).

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release\te01
\_s\Release\debug\_nets.ltx

Names and parameters of probes are added to the ILA Waveform window. See Figure 10.

Use + to select probes used for triggering, and select the condition for the trigger for each probe and their combinations (use AND as default).

Some of debug probes can be used to trigger the capturing of data. The ILA can be triggered from the EdkDSP firmware running on the PicoBlaze6 running inside of the (8xSIMD) EdkDSP unit.

30/58





Figure 10: Release demo t01\_s. Probes file is specified. Trigger conditions are set.

In Xilinx SDK 2017.4.1, open the EdkDSP C soure file:  $c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SDK\_Workspace\end{descentile} dkdsp\a\f2.c$ 

See section of the **LMS** C code firmware. This C code includes the additional call to the **pb2dfu\_set()** function used for selective triggering of the ILA scope in specified point of computation of the EdkDSP accelerator.

```
pb2dfu_set(0x20, 0); // trigger (0x00 on port 0x20) for the ILA
for (i = 0; i < 4; i++) {
        for (j = 2; j <= 3; j++) {
            lms(j, n, op);
            pb2mb_eoc(led);
        }
}</pre>
```





Figure 11: Release demo t01 s. Details of the 8xSIMD EdkDSP LMS filter computation.

```
In Vivado Lab Tool 2017.4.1, in the ILA configuration page, change the trigger condition to: (bce\_port\_wr ==1) AND (probe10[0:7] ==0x20) AND (probe9[0:7] ==0x00). (bce\_port\_wr ==1) AND (bce\_port\_id[0:7] ==0x20) AND (bce\_port[0:7] ==0x00). Selection on the first line corresponds to the System ILA input to the EdkDSP probes on the second line. See connections of EdkDSP and System ILA on Figure 3.
```

In Vivado Lab Edition 2017.4.1, arm the System ILA core by pressing **Run Trigger** button in **Hardware** window. Armed System ILA core will wait until the recompiled EdkDSP firmware comes to the point, where PicoBlaze6 calls function  $pb2dfu_set(0x20, 0)$ .

In case of TE0720-03-2IF, ILA captures 8K samples of all debug probes at 120 MHz. In case of TE0720-03-1QF, ILA captures 8K samples of all debug probes at 100 MHz. In case of TE0720-03-14S-1C, ILA captures 2K samples of all debug probes at 100 MHz.

Data are captured and sent via jtag USB connection in Vivado Lab Edition 2017.4.1 for visualisation and analysis in the waveform window. This snapshot stores the detailed trace of the FIR filter computation. See *Figure 12*.





Figure 12: Release demo t01\_s. Details of the 8xSIMD EdkDSP FIR filter computation.

In Vivado Lab. Tool, in the ILA configuration page, change the trigger condition to (probe9[0:7]==0x01). This corresponds to the condition bce\_port[0:7]==0x01. See connections in *Figure 3*. ILA will capture start of the FIR filter. See *Figure 12*. The PicoBlaze C code of the FIR example is listed in *Figure 19*.

The Vivado Lab. screens presented in *Figure 11* and *Figure 12* display also the 1024 samples before the trigger event. This mode is set in the trigger mode settings window. Screens display how the PicoBlaze6 controller reset signal **bce\_r\_pb** is deactivated. Picoblaze6 reads the 8 bit parameters **op** and **n** from the MicroBlaze before the trigger evet. See complete program listing in *Figure 19* with these initial lines of the PicoBlaze6 SW:

```
pb2dfu_set(0x20, 1); // trigger (0x01 on port 0x20) for the ILA
for (i = 0; i < 4; i++) {
    for (j = 2; j <= 3; j++) {
        fir(j, n, op);
        pb2mb_eoc(led);
    }
}</pre>
```





Figure 13: Release demo t01\_s. Standalone demo supports measurements of the chip temperature.

The standalone demos support measurement of the chip temperature in a new dashboard connected to the XADC system monitor.



#### Installation and use of Release Evaluation Package - Linux examples

In case of Linux target:

- (1) In Win 7 or Win 10 (32bit or 64bit PC), unzip the basic evaluation package TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL.zip to directory of your choice. We will use: c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\
- (2) Select one of the examples (t01\_I ... t21\_I) and copy the content of sd\_card directory to the SD card. Example. Copy the content (and the subdirectory with its content) from the directory: c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release \te01\_1\Release\sd\_card\ to the root of the SD card.
- (3) Connect Mini USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect Micro USB cable from to the USBUART pmod module present in the J5 connector) to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board. And open putty (or similar) terminal client for both USB serial lines. Set the serial communication to: [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.
- (6) Insert SD card to the TE0706-02 or TE0703-05 carrier board.
- (7) Reset the carrier board.
  - The Linux system will start. See Figure 14.

type user name:

root

type password:

root

- Mount SD card to the directory (See Figure 15) /mnt by typing:

mount /dev/mmcblk0p1 /mnt

- Change directory (See Figure 15) to /mnt

cd /mnt

- -Compile firmware for the PicoBlaze6 by the EdkDSP C compiler (See Figure 15):
  - ./edkdsp/tools/cc fx.sh ./edkdsp/a
- or./edkdsp/tools/cc fx.sh ./edkdsp/b or ./edkdsp/tools/cc fx.sh ./edkdsp/c
- The PicoBlaze6 C source code f0.c f1.c f2.c and f3.c from the directory ./edkdsp/a are compiled by the EdkDSP C compiler to the firmware files (See Figure 15):
  - ./f0.dec ./f1.dec ./f2.dec ./f3.dec
- The ARM terminal will present output from the EdkDSP C compiler
- The MicroBlaze terminal is not active. EdkDSP is not programmed yet.
- Start the Linux user space application by typing:
- ./t01 l.elf
- The ARM terminal will present output from the t01\_1.elf example. See Figure 16.
- The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP working with new firmware programs as re-compiled by the EdkDSP C compiler from the C source code files: f0.c f1.c f2.c and f3.c

from the directory ./edkdsp/a





The output from the 8xSIMD EdkDSP is identical to the standalone output. See Figure 8.

- (8) In PC, open the Vivado Lab tool. See *Figure 9*.
  - Open Vivado Lab tools 2017.4.1 hardware manager.
  - Press Auto Connect icon in Hardware window
  - Open description of debug nets present in file, thus specifying the probes file. See Figure 10.
  - c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release
    \te01 l\Release\debug\_nets.ltx
  - Set the ILA trigger conditions and observe process of computation in the 8xSIMD EdkDSP IP. See *Figure 11*, *Figure 12*.
- (9) Close Vivado Lab tool project.
- (10) Remove SD card and reprogram it in PC to test another example.
- (11) Go to step (6).

```
/etc/rcS.d/S99run-postinsts
INIT: Entering runlevel: 5
Configuring network interfaces... udhcpc (v1.24.1) started
Sending discover...
Sending discover...
Sending discover...
No lease, forking to background
Starting Dropbear SSH server: Generating key, this may take a while.
Public key portion is:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCTU2PAzvgDZOL01yMRynDqBUmi/I8x
2TQ2a8yQNJuxD0c63gjLRx9lowQiM0exSLCKS9qrL0qRCMXuGk0xZUq4zbF9rdstwMXH
3NtbMEcOcuQTt+hlruHWnOh2dfML8V5/8eQq6ifo4O6+fL5mQCGWICRBIZrh0/20QqIA
smPDorHXOGwbP5leAKJ2Q8bhWM90PUKAs0HVMR3HRZ4kJV1Fo6xARNi1CvRX7tx6I4YB
zUFQCzvErqwftKUlmpAaE/M3y+lQuaMkE/coAuzB0qNuCkIJQYb5eNDyVN1mHh+eI0kP
SPSmDx/yTsw123FbYhsu3kEQAMZSsbKYfh9Ex7yJ root@petalinux
Fingerprint: md5 c3:ab:f7:b6:57:f4:02:4f:80:bb:5a:28:5e:2a:a9:63
dropbear.
Init Start
Init End
Starting syslogd/klogd: done
Starting tcf-agent: OK
PetaLinux 2017.4 petalinux /dev/ttyPS0
petalinux login:
```

Figure 14: Release demo t01\_l. Linux start.



```
_ D X
SPSmDx/yTswl23FbYhsu3kEQAMZSsbKYfh9Ex7yJ root@petalinux
Fingerprint: md5 c3:ab:f7:b6:57:f4:02:4f:80:bb:5a:28:5e:2a:a9:63
dropbear.
Init Start
Init End
Starting syslogd/klogd: done
Starting tcf-agent: OK
PetaLinux 2017.4 petalinux /dev/ttyPS0
petalinux login: root
Password:
root@petalinux:~# mount /dev/mmcblk0p1 /mnt
root@petalinux:~# cd /mnt
root@petalinux:/mnt# ls
                                   f3.dec
BOOT.bin
           edkdsp
                       f1.dec
                                               te01 l.elf
                       f2.dec
sds
           f0.dec
                                   image.ub
root@petalinux:/mnt# ./edkdsp/tools/cc_fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPPSM: f0.psm ...
EDKDSPCC : f1.c ...
EDKDSPPSM: fl.psm ...
EDKDSPCC : f2.c ...
EDKDSPPSM: f2.psm ...
EDKDSPCC : f3.c ...
EDKDSPPSM: f3.psm ..
root@petalinux:/mnt#
```

Figure 15: Release demo t01 l; Login, Compilation of firmware in the EdkDSP C Compiler.



```
_ D X
Init Start
Init End
Starting syslogd/klogd: done
Starting tcf-agent: OK
PetaLinux 2017.4 petalinux /dev/ttyPS0
petalinux login: root
Password:
root@petalinux:~# mount /dev/mmcblk0p1 /mnt
root@petalinux:~# cd /mnt
root@petalinux:/mnt# ls
BOOT.bin
           edkdsp
                        f1.dec
                                    f3.dec
                                                te01 l.elf
            f0.dec
                        f2.dec
                                    image.ub
root@petalinux:/mnt# ./edkdsp/tools/cc fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPPSM: f0.psm ...
EDKDSPCC : f1.c ...
EDKDSPPSM: fl.psm ...
EDKDSPCC : f2.c ...
EDKDSPPSM: f2.psm ...
EDKDSPCC : f3.c ...
EDKDSPPSM: f3.psm ...
root@petalinux:/mnt# ./te01 l.elf
/dev/mem opened.
Memory mapped at address 0xb6f1c000.
Memory mapped at address 0xae395000.
ARMCPU0: Write firmware ...
ARMCPU0: Open input file f0.dec ... OK
ARMCPU0: Open input file f1.dec ... OK
ARMCPU0: Open input file f2.dec ... OK
ARMCPU0: Open input file f3.dec ... OK
ARMCPU0: Close input file f0.dec ... OK
ARMCPU0: Close input file f1.dec ... OK
ARMCPU0: Close input file f2.dec ... OK
ARMCPU0: Close input file f3.dec ... OK
ARMCPU0: Write firmware Done.
Reset for 1 sec. ... Done.
ARMCPU0: MB0 reset removed, ARM waiting ... Done.
ARMCPU0: MB0 indicates - running ...
Number of CPU cycles running application in software: 164480
Number of CPU cycles running application in hardware: 48380
Speed up: 3.39975
TEST PASSED
root@petalinux:/mnt#
```

Figure 16: Release demo t01\_l; Program and start 8xSIMD EdkDSP demo.



# 9. Installation and Use of Debug Evaluation Package

The debug evaluation package is offered to the ECSEL PRODUCTIVE 4.0 project partners [8] on their written request to UTIA for free. See the license conditions listed in next sections of this report.

The debug evaluation package supports:

- Compilation from C source code and debug for the MicroBlaze processor for Linux and standalone targets
- Creation and Release of SD cards with new compiled MicroBlaze SW and new compiled Picoblaze6 firmware for Linux and standalone targets.
- In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug of the 8xSIMD EdkDSP IP.
  - o In case of TE0720-03-2IF, ILA captures 8K samples of debug probes at 120 MHz.
  - o In case of TE0720-03-1QF, ILA captures 8K samples of debug probes at 100 MHz.
  - o In case of TE0720-03-14S-1C, ILA captures 2K samples of debug probes at 100 MHz.
- Embedded Compilation from a C/ASM source code to firmware for the reprogrammable PicoBlaze6 finite state machine (FSM) scheduling inside of the 8xSIMD EdkDSP IP core the floating point computation sequences performed in the 8xSIMD data flow unit (DFU). This embedded compilation is supported for the Linux examples.
- The standalone examples also support ILA and additionally can display the on-chip temperature via JTAG.
- The extended evaluation package requires the Xilinx SDK 2017.4.1 tools (download is free). SDK serves for compilation of MicroBlaze code, download of compiled MicroBlaze code via JTAG and for the debug of this code in parallel with the ILA inspection/observation/debug of the EdkDSP IP core.
- The In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug can be performed from the free Xilinx Lab Vivado 2017.4.1 tool installed on Win7 (64bit) or Win 10 (64bit) PC.
- The Linux target examples support 1G Bit Ethernet, SSH telnet and file system management tools like the Total Commander for an Ethernet based access from PC to the SD card files and editing of these files from user PC.

The extended evaluation package provides 21 precompiled designs for the Linux target and 19 precompiled designs for the standalone target as described in *Table 12*.



## Installation and use of debug evaluation package – standalone examples

In case of standalone target:

(1) In Win 7 or Win 10 (64 bit PC), unzip the debug evaluation package: TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_INSTALL.zip to directory of your choice. We will use:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_INSTALL\

In Xilinx SDK 2017.4.1 create a new workspace in the directory of your choice. We will use: c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_SDK\_Workspace\



Figure 17: Create new SDK 2017.4.1 workspace.





Figure 18: Import the extended debug evaluation package projects into the SDK Workspace.

Import (with copy) all SDK projects from:

- c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_INSTALL\SDSoC\_PFM\2if\SDK\_Workspace\
  to the new SDK workspace.
- c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_SDK\_Workspace\

Both Microblaze projects will be compiled automatically by the SDK for the debug configuration.

- (2) Select one of the examples (t01\_s ... t21\_s) and copy the content of the sd\_card directory to the SD card. Example: Copy B00T.bin to the root of the SD card from:
  c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_INSTALL\SDSoC\_PFM\2if\SD\_debug\te01\_s\Release\sd\_card\B00T.bin
- (3) Connect Mini USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect Micro USB cable to the USBUART pmod module present in the J5 connector to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board. And open putty (or similar) terminal client for both USB serial lines. Set the serial communication to [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.





Figure 19: SDK compiles MicroBlaze SW projects for the standalone debug target.

- (6) Insert SD card to the TE0706-02 or TE0703-05 carrier board.
- (7) Reset the carrier board.
  - The standalone system will start.
  - The ARM terminal will present output from the **t01\_s** example.
  - The Arm application is waiting for the MicroBlaze.



```
COM65-PuTTY

Xilinx First Stage Boot Loader (TE modified)
Release 2017.4 May 7 2018-08:27:58

Device IDCODE -> 23727093
Revision -> 2
Device -> 7 (7z020)

SoM: TE0720-03-2I F SC REV:05
MAC: D8 80 39 DE 68 E0
ARMCPU0: place 0xb8000000 at start of MB0 vectors

ARMCPU0: MB0 reset removed, ARM waiting ...
```

Figure 20: Debug demo t01\_l; Execution of the ./t01\_s.elf example from the SD card.

- The Xilinx SDK project
  - c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_SDK\_Workspace\edkdsp\_fp12\_1x8\_s includes PicoBlaze6 firmware header files fill\_f0\_program\_store.h,
- fill\_f1\_program\_store.h, fill\_f2\_program\_store.h and fill\_f3\_program\_store.h Note: These files can be recompiled from the C source code by the EdkDSP C compiler in the Linux target session as described in the next section).
- In the Xilinx SDK workspace, compile the **edkdsp\_fp12\_1x8\_s** project with the existing (or new, recompiled) PicoBlaze6 firmware headers **fill\_f0\_program\_store.h**,
  - fill\_f1\_program\_store.h, fill\_f2\_program\_store.h and fill\_f3\_program\_store.h.
- In the Xilinx SDK workspace, select Debug of MicroBlaze project **edkdsp\_fp12\_1x8\_s**. In the Debug Configurations, select "No reset", unselect "Run ps7\_init", unselect "Run ps7\_post\_config" click "Apply".



Figure 21: Debug demo t01\_s; Open project edkdsp\_fp12\_1x8\_s for debug.



Akademie věd České republiky



Figure 22: Debug demo t01 s; Start the free-run from the debugger.

- In the SDK debugger, step through the MicroBlaze source code, inspect content of variables, set the breakpoints, step through the code and finally select the free run of the MicroBlaze code.
- At this stage, the ARM terminal will present the output from the ARM t01 s.elf example. See Figure 23.

```
ARMCPUO: MBO reset removed, ARM waiting ...
ARMCPUO: MBO indicates - running ...
Number of CPU cycles running application in software: 163178
Number of CPU cycles running application in hardware: 26692
Speed up: 6.11337
Note: Speed up is meaningful for real hardware execution only, not f or emulation.
TEST PASSED
```

Figure 23: Debug demo t01\_s. Arm started EdkDSP and runs SDSoC akcelerátor demo.



The MicroBlaze terminal will present output from the debugged MicroBlaze and the 8xSIMD EdkDSP IP core. See *Figure 24*.

```
COM57 - PuTTY
 MB0 : Start of MB ... Done.
 MB0 : Read firmware ... Done.
 Initialize TmrCtr for axi timer 0...
 MBO : (EdkDSP 8xSIMD) Write firmware ...
 MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
 MB0 : (HW FP unit
                    ) Far-end signal ...
       (EdkDSP 8xSIMD) FIR room response ... 1117 MFLOPs
       (HW FP unit ) Add near-end signal ...
                                               728 MFLOPs
       (EdkDSP 8xSIMD) LMS Identification ...
 MB0 :
      (HW FP unit ) LMS Identification ...
                                                3 MFLOPs
 MB0 : (EdkDSP 8xSIMD) OK
 MB0 : (EdkDSP 8xSIMD) Write firmware ...
      (EdkDSP 8xSIMD) Capabilities1 = 13ffff
       (EdkDSP 8xSIMD) VZ2A 'worker1'
       (EdkDSP 8xSIMD) VB2A 'worker1'
 MB0 :
       (EdkDSP 8xSIMD)
                      VZ2B 'worker1'
 MB0 : (EdkDSP 8xSIMD) VA2B 'worker1'
 MB0 : (EdkDSP 8xSIMD) VADD 'worker1'
      (EdkDSP 8xSIMD) VADD BZ2A 'worker1' ..
       (EdkDSP 8xSIMD) VADD AZ2B 'worker1'
       (EdkDSP 8xSIMD) VSUB 'worker1'
       (EdkDSP 8xSIMD) VSUB BZ2A 'worker1' ...
       (EdkDSP 8xSIMD) VSUB AZ2B 'worker1'
 MB0
 MB0 : (EdkDSP 8xSIMD) VMULT 'worker1'
 MB0 : (EdkDSP 8xSIMD) VMULT BZ2A 'worker1'
 MB0 : (EdkDSP 8xSIMD) VMULT AZ2B 'worker1'
 MB0 : (EdkDSP 8xSIMD) VPROD 'worker1'
       (EdkDSP 8xSIMD) VMAC 'worker1'
       (EdkDSP 8xSIMD) VMSUBAC 'worker1'
       (EdkDSP 8xSIMD)
                       VPROD S8 'worker1' ... OK
 MB0 : (EdkDSP 8xSIMD) VDIV 'worker1'
```

Figure 24: Debug demo t01\_s; MicroBlaze project output (Compiled for Debug).

(8) In PC, open the Vivado Lab tool. See Figure 9.

- Open Hardware Manager.
- Press Auto Connect icon in Hardware window to connect to the board via JTAG line.
- Open description of debug nets present in file, thus specifying the probes file as:
- c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_INSTALL\SDSoC\_PFM\2if\SD\_debug\
  te01\_s\Release\debug\_nets.ltx
- Set the ILA trigger conditions and observe process of computation in the 8xSIMD EdkDSP IP. See *Figure 10*, *Figure 11*, *Figure 12*.
- Open new perspective and observe the chip temperature. See Figure 13. Close Vivado Lab tool project.



- (9) In SDK debugger, stop MicroBlaze processor and close the debug session.
- (10) Remove SD card and reprogram it in the PC to test another example.
- (11) Go to step (6).

## Installation and use of Debug Evaluation Package – Linux examples

(1) In Win 7 or Win 10 (64bit PC), unzip the basic evaluation package TE0720 EdkDSP 2if te706 ila8k Debug INSTALL.zip

to directory of your choice. We will use:

c:\TS74\TE0720\_EdkDSP\_14s\_te706\_ila2k\_Debug\_INSTALL\

Open new Xilinx SDK 2017.4.1 workspace in the directory

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\SDK\_Workspace\

Import (with copy) all SDK projects from

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_INSTALL\SDSoC\_PFM\2if\SDK\_Workspace\ to the new SDK.

(2) Select one of the examples (t01\_1 ... t21\_1) and copy the content of sd\_card directory to the SD card. Example. Copy content of the directory from

c:\TS74\TE0720 EdkDSP 2if te706 ila8k Debug INSTALL\SDSoC PFM\2if\SD debug\ te01\_l\Release\sd\_card\

to the root of the SD card/

- (3) Connect USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect USB cable to the USBUART pmod module present in the J5 connector to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board. And open putty (or similar) terminal client for both USB serial lines. Set the serial communication to [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.

46/58

- (6) Insert SD card to the TE0706-02 or TE0703-05 carrier board.
- (7) Reset the carrier board.
  - The Linux system will start. See Figure 25.
  - Type the Linux user name and password:

root

root



```
_ D X
COM65 - PuTTY
Password:
root@petalinux:~# mount /dev/mmcblk0p1 /mnt
root@petalinux:~# cd /mnt
root@petalinux:/mnt# ls
           _sds
BOOT.BIN
                        f0.dec
                                    f2.dec
                                                image.ub
README.txt
                                                te01 l.elf
            edkdsp
                        f1.dec
                                    f3.dec
root@petalinux:/mnt# ./edkdsp/tools/cc fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPPSM: f0.psm ...
EDKDSPCC : f1.c ...
EDKDSPPSM: fl.psm ...
EDKDSPCC : f2.c ...
EDKDSPPSM: f2.psm ...
EDKDSPCC : f3.c ...
EDKDSPPSM: f3.psm ...
root@petalinux:/mnt# ./edkdsp/tools//cs fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPASM: f0.psm ...
Generated M function file in the M file ././fill f0 program store.m
Generated C header file in the H file ./fill f0 program store.h
EDKDSPCC : f1.c ...
EDKDSPASM: fl.psm ...
Generated M function file in the M file ././fill f1 program store.m
Generated C header file in the H file ./fill f1 program store.h
EDKDSPCC : f2.c ...
EDKDSPASM: f2.psm ...
Generated M function file in the M file ././fill f2 program store.m
Generated C header file in the H file ./fill f2 program store.h
EDKDSPCC : f3.c ...
EDKDSPASM: f3.psm ...
Generated M function file in the M file ././fill f3 program store.m
Generated C header file in the H file ./fill f3 program store.h
root@petalinux:/mnt# ./te01_l.elf
/dev/mem opened.
Memory mapped at address 0xb6f1c000.
Memory mapped at address 0xae395000.
ARMCPU0: Write firmware ...
ARMCPU0: Open input file f0.dec ... OK
ARMCPU0: Open input file f1.dec ... OK
ARMCPU0: Open input file f2.dec ... OK
ARMCPU0: Open input file f3.dec ... OK
ARMCPU0: Close input file f0.dec ... OK
ARMCPU0: Close input file f1.dec ... OK
ARMCPU0: Close input file f2.dec ... OK
ARMCPU0: Close input file f3.dec ... OK
ARMCPU0: Write firmware Done.
ARMCPU0: place 0x28100000 at start of MB0 vectors
Reset for 1 sec. ... Done.
ARMCPU0: MB0 reset removed, ARM waiting ...
```

Figure 25: Compiled EdkDSP firmware. Started debug demo - Linux target t01\_l.



Ústav teorie informace a automatizace AV ČR, v.v.i.

- Mount SD card to the directory (See Figure 25) /mnt by typing: mount /dev/mmcblk0p1 /mnt
- Change directory to /mntcd /mnt
- Compile firmware for the PicoBlaze6 by the EdkDSP C compiler (see Figure 25):
  - ./edkdsp/tools/cc\_fx.sh ./edkdsp/a
- The PicoBlaze6 C source code files from the directory ./edkdsp/a
  - ./edkdsp/a/f0.c ./edkdsp/a/f1.c ./edkdsp/a/f2.c ./edkdsp/a/f3.c are compiled by the EdkDSP C compiler to the firmware files:
  - ./f0.dec ./f1.dec ./f2.dec ./f3.dec
- Optionally, you can also compile the PicoBlaze6 firmware into header files for the standalone target. Compile firmware for the PicoBlaze6 by the EdkDSP C compiler. (See *Figure 25*):
- ./edkdsp/tools/cs\_fx.sh ./edkdsp/a

Generated header files with PicoBlaze6 firmware for the standalone target EdkDSP IP target are created and stored in the SD card root directory:

- ./fill\_f0\_program\_store.h ./fill\_f1\_program\_store.h
- ./fill\_f2\_program\_store.h ./fill\_f3\_program\_store.h

These headers serve for the standalone MicroBlaze projects. Headers are compiled directly into the debugged MicroBlaze standalone application as described above.

- Execute the ARM Linux application See Figure 25.
- The ARM terminal will present output from the EdkDSP C compiler
- The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP
- Start the Linux application by typing ./t01\_l.elf
- The ARM terminal will present output from the **t01\_1.elf** example. The Arm application is waiting for the MicroBlaze in this stage.
- In the Xilinx SDK environment on the PC, select debug project (See Figure 26):
- c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_SDK\_Workspace\edkdsp\_fp12\_1x8\_1



Figure 26: Select MicroBlaze project edkdsp\_fp12\_1x8\_I for debug.



- In the SDK debugger, step through the MicroBlaze source code, inspect the content of variables, set breakpoints etc. See *Figure 27*.
- In the SDK debugger, select free run of the MicroBlaze code. See Figure 27.
- The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP working with new firmware programs as re-compiled by the EdkDSP C compiler from the C source code files:
- ./edkdsp/a/f0.c, ./edkdsp/a/f1.c, ./edkdsp/a/f2.c and ./edkdsp/a/f3.c The terminal Output is identical to *Figure 24*.
- The ARM terminal will continue to present output from the t01\_1.elf example. See Figure 28.
- In ARM terminal, type:

#### ls -lr

to see listing of files compiled by the EdkDSP C compiler. See Figure 28.

The compiled header files fill\_f0\_program\_store.h, fill\_f1\_program\_store.h,

fill\_f2\_program\_store.h, and fill\_f3\_program\_store.h can be used as new source code for the standalone MicroBlaze project

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_SDK\_Workspace\edkdsp\_fp12\_1x8\_s



Figure 27: Select free run of MicroBlaze project edkdsp\_fp12\_1x8\_l.



http://zs.utia.cas.cz

Ústav teorie informace a automatizace AV ČR, v.v.i.

- (8) In PC, open the Vivado Lab tool hardware manager. See Figure 9.
- Press Auto Connect icon in Hardware window to connect to the board via JTAG line
- Open description of debug nets present in file, thus specifying the probes file
- Open description of debug nets present in file
- c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_INSTALL\SDSoC\_PFM\2if\SD\_debug\te01\_s\
  Release\debug\_nets.ltx
- Set the ILA trigger conditions. See Figure 10, Figure 11, Figure 12. Close Vivado Lab tool project.
- (9) In SDK debugger, stop MicroBlaze processor and close the debug session
- (10) Exit from Linux by typing on the ARM terminal: exit
- (11) Remove SD card and reprogram it in the PC to test another example.
- (12) Go to step (6).

```
ᄰ COM65 - PuTTY
ARMCPU0: Close input file f2.dec ... OK
ARMCPU0: Close input file f3.dec ... OK
ARMCPU0: Write firmware Done.
ARMCPU0: place 0x28100000 at start of MB0 vectors
Reset for 1 sec. ... Done.
ARMCPU0: MB0 reset removed, ARM waiting ... Done.
ARMCPU0: MB0 indicates - running ...
Number of CPU cycles running application in software: 164782
Number of CPU cycles running application in hardware: 48030
Speed up: 3.43081
TEST PASSED
root@petalinux:/mnt# ls -lr
total 13504
              1 root
                         disk
                                      68656 May 7 2018 te01 l.elf
-rwxrwx---
                                         32 Apr 16 10:21 sds trace data.dat
                         disk
-rwxrwx---
                         disk
                                    9986400 Apr 16 08:16 image.ub
              1 root
-rwxrwx---
                                      11547 Apr 16 10:09 fill f3 program store.m
rwxrwx---
                         disk
              1 root
                                      11240 Apr 16 10:09 fill_f3_program_store.h
 rwxrwx---
              1 root
                         disk
                                      11895 Apr 16 10:09 fill_f2_program_store.m
rwxrwx---
              1 root
                         disk
                        disk
                                      11588 Apr 16 10:09 fill_f2_program_store.h
-rwxrwx---
              1 root
rwxrwx---
              1 root
                                      11482 Apr 16 10:09 fill_f1_program_store.m
                        disk
                        disk
                                      11175 Apr 16 10:09 fill fl_program_store.h
rwxrwx---
              1 root
                        disk
                                     11482 Apr 16 10:09 fill f0 program store.m
rwxrwx---
             1 root
rwxrwx---
             1 root
                        disk
                                     11175 Apr 16 10:09 fill f0 program store.h
                                     10475 Apr 16 10:09 f3.psm
-rwxrwx---
            1 root
                        disk
             1 root
                        disk
                                      20689 Apr 16 10:09 f3.log
-rwxrwx---
             1 root
                                       2221 Apr 16 10:09 f3.dec
-rwxrwx---
                        disk
              1 root
                         disk
                                      11867 Apr 16 10:09 f2.psm
rwxrwx---
                                      23557 Apr 16 10:09 f2.log
              1 root
                         disk
rwxrwx---
                         disk
                                       2861 Apr 16 10:09 f2.dec
rwxrwx---
              1 root
-rwxrwx---
              1 root
                        disk
                                      10102 Apr 16 10:09 fl.psm
                                     19398 Apr 16 10:09 fl.log
-rwxrwx---
              1 root
                        disk
                         disk
                                      2076 Apr 16 10:09 fl.dec
-rwxrwx---
              1 root
-rwxrwx---
              1 root
                         disk
                                      10102 Apr 16 10:09 f0.psm
rwxrwx---
              1 root
                         disk
                                     19398 Apr 16 10:09 f0.log
                         disk
                                      2076 Apr 16 10:09 f0.dec
rwxrwx---
              1 root
drwxrwx---
              6 root
                         disk
                                      32768 May 12
                                                    2018 edkdsp
                                      32768 May 12
                                                    2018
              2 root
                         disk
drwxrwx---
                                                          sds
                                        186 May
                                                    2018 README.txt
              1 root
                         disk
rwxrwx---
                                    2942248 May
                                                    2018 BOOT.BIN
 rwxrwx---
              1 root
                         disk
root@petalinux:/mnt#
```

int of Figure 28: Output from ARM MicroBlaze fort t01\_l. Compiled EdkDSP firmware.



## Updating of the release SD card images for new standalone-release-target

Modified Picoblaze6 C source code can be compiled to firmware headers in the embedded EdkDSP C compiler (Linux target). Resulting headers can be included in the SDK MicroBlaze standalone release target project. See *Figure 19*. The standalone-release-target SD card image can be updated by re-compilation of the (possibly modified) C source code for the MicroBlaze in the SDK project with included updated PicoBlaze firmware header files. See *Figure 29*.



Figure 29: Create BOOT.bin for the t01\_s demo.

See the content of directory:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release\te01
s\Release\uboot\

The new **BOOT.bin** image can be created from these five files:

t01\_s.bif, zynq\_fsbl.elf, zynq\_wrapper.bit.elf, t01\_s.elf, edkdsp\_fp12\_1x8\_s.elf Replace an old edkdsp\_fp12\_1x8\_s.elf file with the new file recompiled in the SDK (with new PicoBlaze6 firmware headers) from the SDK project:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_SDK\_Workspace\edkdsp\_fp12\_1x8\_s
Use the BOOT.bin generation utility (In the SDK workspace: Xilinx Tools -> Create Boot Image) and create the new BOOT.bin file (See Figure 29):

 $\label{lem:c:TS74TE0720_EdkDSP_2if_te706_ila8k_Release_INSTALL\SDSoC_PFM\2if\SD_release\te01\_s\Release\uboot\BOOT.bin$ 

Copy this new BOOT.bin file it to:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release\te01
s\Release\sd card\B00T.bin

The content of the standalone-release-target SD card is updated with new MicroBlaze and PicoBlaze6 firmware.

signal processing http://zs.utia.cas.cz

51/58



### Updating of the release SD card images for new Linux-release-target

The Linux-release-target SD card image can be updated by re-compilation of the (possibly modified) C source code for the MicroBlaze in the SDK project. See *Figure 30*.



Figure 30: Create BOOT.bin for the t01\_l demo.

Use the content of directory:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release\te01
\_1\Release\uboot\ The new BOOT.bin image can be created from these files:

te\_l.bif, zynq\_fsbl.elf, zynq\_wrapper.bit.elf, u-boot.elf, edkdsp\_fp12\_1x8\_l.elf

#### Replace:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release\te01
\_1\Release\uboot\edkdsp\_fp12\_1x8\_1.elf

with a new file recompiled in the SDK project:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Debug\_SDK\_Workspace\edkdsp\_fp12\_1x8\_1 Use the BOOT.bin generation utility of the SDK and create the new BOOT.bin file:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release\te01
\_1\Release\uboot\BOOT.bin

Copy this new **BOOT.bin** file to:

c:\TS74\TE0720\_EdkDSP\_2if\_te706\_ila8k\_Release\_INSTALL\SDSoC\_PFM\2if\SD\_release\te01
\_1\Release\sd\_card\B00T.bin

52/58



Copy modified **f0.c**, **f1.c f2.c** and **f3.c** to the directory:

 $\label{lem:c:TS74} $$c:\TS74\TE0720\_EdkDSP_2if_te706_ila8k_Release_INSTALL\SDSoC\_PFM\2if\SD\_release\te01_l\Release\sd\_card\edkdsp\a\$ 

Copy compiled **f0.dec**, **f1.dec f2.dec** and **f3.dec** to the directory:

 $\label{lem:c:TS74} $$c:\TS74\TE0720\_EdkDSP_2if_te706_ila8k_Release_INSTALL\SDSoC\_PFM\2if\SD\_release\te01_l\Release\sd\_card\$ 

The content of the Linux-release-target SD card is updated with new MicroBlaze and PicoBlaze6 firmware and stored in the PC.



## 10. References

[1] **TE0720-03-2IF**; Part: XC7Z020-2CLG484I; 1 GByte DDR; Industrial Grade (Tj = -40°C to +100°C) http://shop.trenz-electronic.de/en/TE0720-03-2IF-Xilinx-Zyng-module-XC7Z020-2CLG484I-ind.-temp.-range-1-Gbyte https://www.trenz-electronic.de/fileadmin/docs/Trenz\_Electronic/TE0720/REV03/Documents/TRM-TE0720-03.pdf https://www.trenz-electronic.de/fileadmin/docs/Trenz Electronic/Modules and Module Carriers/4x5/TE0720/REV03/ Documents/SCH-TE0720-03-2IF.PDF

TE0720-03-1QF; Part: XA7Z020-1CLG484Q; 1 GByte DDR; Automotive Grade (Tj = -40°C to +125°C) https://shop.trenz-electronic.de/en/TE0720-03-1QF-Xilinx-Zynq-module-ind.-temp.-range-with-Automotive-XA7Z020-1CLG484Q https://www.trenz-electronic.de/fileadmin/docs/Trenz Electronic/Modules and Module Carriers/4x5/TE0720/REV03/ Documents/SCH-TE0720-03-1QF.PDF

**TE0720-03-214S-1C**; Part: XC7Z014S-1CLG484C; 1 GByte DDR; Industrial Grade (Tj = 0°C to +85°C) https://shop.trenz-electronic.de/en/TE0720-03-14S-1C-SoC-Module-with-Xilinx-Zynq-Z-7014S-Single-core-1-GByte-DDR3 https://www.trenz-electronic.de/fileadmin/docs/Trenz Electronic/Modules and Module Carriers/4x5/TE0720/REV03/ Documents/SCH-TE0720-03-14S-1C.PDF

[2] Heatsink for TE0720, spring-loaded embedded; https://shop.trenz-electronic.de/en/26922-Heatsink-for-TE0720-spring-loaded-embedded?c=38

[3] TE0706-02 Carrierboard for Trenz Electronic Modules with 4 x 5 cm Form factor https://shop.trenz-electronic.de/en/TE0706-02-TE0706-Carrierboard-for-Trenz-Electronic-Modules-with-4-x-5-cm-Form-factor?c=261 https://www.trenz-electronic.de/fileadmin/docs/Trenz\_Electronic/carrier\_boards/TE0706/REV02/documents/SCH-TE0706-02.PDF https://wiki.trenz-electronic.de/display/PD/TE0706+TRM

#### **TE0703-05** Carrier board for Trenz Electronic Modules with 4 x 5 cm Form factor

https://shop.trenz-electronic.de/en/TE0703-05-TE0703-Carrier-board-for-Trenz-Electronic-modules-with-4-x-5-cm-form-factor?c=261 https://www.trenz-electronic.de/fileadmin/docs/Trenz Electronic/Modules and Module Carriers/4x5/4x5 Carriers/ TE0703/REV05/Documents/SCH-TE0703-05.PDF https://wiki.trenz-electronic.de/display/PD/TE0703+TRM

- [4] **Pmod USBUART**: Serial converter & interface. https://shop.trenz-electronic.de/en/24242-Pmod-USBUART-USB-to-UART-Interface?c=80
- [5] **XMOD FTDI JTAG Adapter** Xilinx compatible https://shop.trenz-electronic.de/en/TE0790-02-XMOD-FTDI-JTAG-Adapter-Xilinx-compatible
- [6] Vivado HLx Web Install Client 2017.4.1. https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools/2015-4.html
- [7] SDSoC 2017.4.1 Full Product Installations. https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNay/sdx-development-environments/sdsoc/2015-4.html

54/58

[8] PRODUCTIVE 4.0 Project www page in UTIA with pointers to evaluation packages for download http://sp.utia.cz/index.php?ids=projects/productive40



## 11. Base Release Evaluation Package

The base, release evaluation package can be downloaded from UTIA www pages [8] free of charge.

#### **Deliverables:**

The base, release evaluation package [8] includes evaluation bitstreams with single (8xSIMD) EdkDSP IP working in parallel with selected HW-accelerated SDSoC algorithms on the Trenz Electronic TE0720-03-2IF, TE0720-03-1QF and TE0720-03-14S-1C module [1] located on the Trenz Electronic TE0706-02 or TE0703-05 carrier [3] with PMOD USBUART adapter [4] and XMOD FTDI JTAG Adapter [5].

The evaluation package [8] includes bitstreams compiled with the evaluation version of the (8xSIMD) EdkDSP IP core. Bitstreams contain these IPs:

bce\_fp12\_1x8\_0\_axiw\_v1\_10\_c Evaluation version of the AXI-lite interface bce\_fp12\_1x8\_40 Evaluation version of the floating point data path

The base, release evaluation version of the (8xSIMS) EdkDSP IP is compiled into bitstreams with a HW limit on number of vector operations. The termination of the nonexclusive, non-transferable evaluation license of this evaluation IP core is reported in advance by the demonstrator on the PMOD USBUART terminal. The evaluation designs will run again after the reset (TE0706-02: Reset push button S2; TE0703-05: Reset push button S1).

The base evaluation package [8] includes these binary applications:

edkdsppp.elf
 EdkDSP C pre-processor binary for ARM PetaLinux running on the evaluation board.
 edkdspcc.elf
 EdkDSP C compiler binary for ARM PetaLinux running on the evaluation board.
 edkdsppsm.elf
 EdkDSP ASM compiler binary for ARM PetaLinux running on the evaluation board.

These binary applications have no time restriction. The user of the evaluation package has nonexclusive, non-transferable license from UTIA to use these utilities for compilation of the firmware for the Xilinx PicoBlaze6 processor inside of the 8xSIMD EdkDSP IP in precompiled designs. The source code of these compilers is owned by UTIA and it is not provided in the evaluation package.

The base, release evaluation package [8] includes demonstration firmware in C source code for the Xilinx PicoBlaze6 processor for the family of UTIA EdkDSP accelerators for the Trenz Electronic TE0720-03-2IF, TE0720-03-1QF and TE0720-03-14S-1C module [1] on Trenz Electronic TE0706-02 or TE0703-05 carrier board [3].

HW boards are not part of deliverables. HW can be ordered separately from [1] – [5].

Any and all legal disputes that may arise from or in connection with the use, intended use of or license for the software provided hereunder shall be exclusively resolved under the regional jurisdiction relevant for UTIA AV CR, v. v. i. and shall be governed by the law of the Czech Republic. See also the Disclaimer section.





# 12. Extended Debug Evaluation Package for PRODUCTIVE 4.0 partners

The extended, debug evaluation package includes **MicroBlaze and PicoBlaze6 C code and precompiled bitstreams of HW projects** for the Trenz Electronic TE0720-03-2IF, TE0720-03-1QF and TE0720-03-14S-1C module [1] located on the Trenz Electronic TE0706-02 or TE0703-05 carrier [3] with PMOD USBUART adapter [4] and XMOD FTDI JTAG Adapter [5] with the evaluation version of the (8xSIMD) EdkDSP IP. Partners of the ECSEL PRODUCTIVE 4.0 project [8] can order this extended package from UTIA AV CR, v.v.i., by email request for quotation to kadlec@utia.cas.cz.

UTIA AV CR, v.v.i., will provide to the PRODUCTIVE 4.0 project partner quotation by email. After confirmation of the quotation by the customer, UTIA AV CR, v.v.i., will send to the customer this invoice:

The extended, debug evaluation package with MicroBlaze and PicoBlaze6 C code and precompiled bitstream of HW projects for the Trenz Electronic TE0720-03-2IF, TE0720-03-1QF and TE0720-03-14S-1C module [1] located on the Trenz Electronic TE0706-02 or TE0703-05 carrier [3] with PMOD USBUART adapter [4] and XMOD FTDI JTAG Adapter [5] with the evaluation version of the 8xSIMD EdkDSP IP for the partners in the ECSEL PRODUCTIVE 4.0 project

(Without VAT) 0,00 Eur

After receiving confirmation from the PRODUCTIVE 4.0 project partner about the zero-invoice received, UTIA AV CR, v.v.i. will send within 5 working days by standard mail printed version of this application note together with DVD with the Deliverables described in this section.

#### **Deliverables:**

The extended, debug evaluation package for PRODUCTIVE 4.0 partners [8] includes MicroBlaze and PicoBlaze6 C code and precompiled bitstreams of HW projects. MicroBlaze and PicoBlaze6 SW projects can be modified and recompiled by the PRODUCTIVE 4.0 project partner.

The extended, debug evaluation version of the UTIA 8xSIMD EdkDSP accelerator IP is provided in precompiled bitstreams of HW projects with these IPs:

bce\_fp12\_1x8\_0\_axiw\_v1\_10\_c Evaluation version of the AXI-lite interface bce\_fp12\_1x8\_40 Evaluation version of the floating point data path

The extended, debug evaluation version of the 8xSIMS EdkDSP IP is compiled into bitstream with an HW limit on number of vector operations. The termination of the nonexclusive, non-transferable evaluation license of this evaluation IP core is reported in advance by the demonstrator on the PMOD USBUART terminal. The evaluation designs will run again after the reset (TE0706-02: Reset push button S2; TE0703-05: Reset push button S1).

The extended, debug evaluation package [8] includes these binary applications:

edkdsppp.elf
 EdkDSP C pre-processor binary for ARM PetaLinux running on the evaluation board.
 edkdspcc.elf
 EdkDSP C compiler binary for ARM PetaLinux running on the evaluation board.
 edkdsppsm.elf
 EdkDSP ASM compiler binary for ARM PetaLinux running on the evaluation board.
 edkdspasm.elf
 EdkDSP ASM compiler binary for ARM PetaLinux running on the evaluation board.

These binary applications have no time restriction. The user of the evaluation package has nonexclusive, non-transferable license from UTIA to use these utilities for compilation of the firmware for the Xilinx PicoBlaze6

signal processing



processor inside of the UTIA EdkDSP accelerators in precompiled designs. The source code of these compilers is owned by UTIA and it is not provided in the evaluation package.

The extended, debug evaluation package for PRODUCTIVE 4.0 partners includes demonstration firmware in C source code for the Xilinx PicoBlaze6 processor for the family of UTIA EdkDSP accelerators for the Trenz Electronic TE0720-03-2IF, TE0720-03-1QF and TE0720-03-14S-1C module [1] on Trenz Electronic TE0706-02 or TE0703-05 carrier board [3].

The extended, debug evaluation package for PRODUCTIVE 4.0 partners includes SDK SW projects with C source code for MicroBlaze. The extended, debug evaluation package [8] includes static library for MicroBlaze processor:

libwal.a SDK 2017.4.1 UTIA static library with EdkDSP API for MicroBlaze

This library has no time restriction. Source code of this library is not provided in this evaluation package.

HW boards are not part of deliverables. HW can be ordered separately from references [1] – [5].

Partners of the ECSEL PRODUCTIVE 4.0 project [8] can order the hardware [1] - [5] directly from the company Trenz Electronic or order the complete evaluation system from UTIA AV CR, v.v.i.

In case of an order from UTIA AV CR, v.v.i., an email request for a quotation to <a href="kadlec@utia.cas.cz">kadlec@utia.cas.cz</a> is required. UTIA AV CR, v.v.i., will provide to the PRODUCTIVE 4.0 project partner quotation by email. After confirmation of the quotation by the PRODUCTIVE 4.0 project partner, UTIA AV CR, v.v.i., will buy from company Trenz Electronic boards [1]-[5] with cables and power supply. UTIA will assemble and test the complete evaluation system and send them to the PRODUCTIVE 4.0 project partner for price identical to the price offered by the company Trenz Electronic plus the transport cost and the VAT.

Any and all legal disputes that may arise from or in connection with the use, intended use of or license for the software provided hereunder shall be exclusively resolved under the regional jurisdiction relevant for UTIA AV CR, v. v. i. and shall be governed by the law of the Czech Republic. See also the Disclaimer section.





## **Disclaimer**

This disclaimer is not a license and does not grant any rights to the materials distributed herewith. Except as otherwise provided in a valid license issued to you by UTIA AV CR v.v.i., and to the maximum extent permitted by applicable law:

- (1) THIS APPLICATION NOTE AND RELATED MATERIALS LISTED IN THIS PACKAGE CONTENT ARE MADE AVAILABLE "AS IS" AND WITH ALL FAULTS, AND UTIA AV CR V.V.I. HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and
- UTIA AV CR v.v.i. shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising under or in connection with these materials, including for any direct, or any indirect, special, incidental, or consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) even if such damage or loss was reasonably foreseeable or UTIA AV CR v.v.i. had been advised of the possibility of the same.

#### **Critical Applications:**

UTIA AV CR v.v.i. products are not designed or intended to be fail-safe, or for use in any application requiring fail-safe performance, such as life-support or safety devices or systems, Class III medical devices, nuclear facilities, applications related to the deployment of airbags, or any other applications that could lead to death, personal injury, or severe property or environmental damage (individually and collectively, "Critical Applications"). Customer assumes the sole risk and liability of any use of UTIA AV CR v.v.i. products in Critical Applications, subject only to applicable laws and regulations governing limitations on product liability.

