

## **Application Note**



# Full HD Video Processing in HW with three EdkDSP 8xSIMD Accelerators for TE0715-30-1 SoM on EMC2-DP-V2 Carrier

Jiří Kadlec, Zdeněk Pohl, Lukáš Kohout kadlec@utia.cas.cz, xpohl@utia.cas.cz, kohoutl@utia.cas.cz phone: +420 2 6605 2216 UTIA AV CR, v.v.i.

#### Revision history:

| Rev. | Date       | Author      | Description                                                                                                            |
|------|------------|-------------|------------------------------------------------------------------------------------------------------------------------|
| 1    | 20.02.2017 | Jiří Kadlec | Evaluation package for Xilinx SDK 2015.4 with max clock and 3x (8xSIMD) EdkDSP. Pre-release for the EMC2 WP4 partners. |
|      |            |             |                                                                                                                        |
|      |            |             |                                                                                                                        |

#### Acknowledgements:

This work has been partially supported by the ARTEMIS JU project EMC2 "Embedded Multi-Core Systems for Mixed Criticality Applications in Dynamic and Changeable Real-Time Environments", project number ARTEMIS JU 621429 and 7H14005 (Ministry of Education Youth and Sports of the Czech Republic). See http://www.emc2-project.eu/.

## **Table of contents**

| Full HD Video Processing in HW with three EdkDSP 8xSIMD Accelerators for TE0715-30-1 SoM on EMC2-DP-V                         | '2 Carrier 1 |
|-------------------------------------------------------------------------------------------------------------------------------|--------------|
| 1. Summary                                                                                                                    | 5            |
| 1.1 Full HD Platform with HDMI I/O and three EdkDSP Accelerators                                                              |              |
| Architecture                                                                                                                  |              |
| General setup of all demos:                                                                                                   |              |
| 1.2 Demos                                                                                                                     | 8            |
| Main objectives of demos:                                                                                                     |              |
| Edge detection                                                                                                                |              |
| 1.3 Measurements of acceleration and resources used in included projects                                                      |              |
| 1.4 Project sh01: Edge detection with single HW accelerator and 3x EdkDSP                                                     | 10           |
| 1.5 Project sh02: Edge detection with two HW accelerators and 3x EdkDSP                                                       | 12           |
| 1.6 Project sh03: Edge detection with three HW accelerators and 3x EdkDSP                                                     | 14           |
| 1.7 Project md01: Motion detection with chain of HW accelerators and 3x EdkDSP                                                | 16           |
| 1.8 Floating point performance                                                                                                | 18           |
| 1.9 Summary                                                                                                                   | 18           |
| 1.10 HW setup                                                                                                                 | 19           |
| 2. Installation of the evaluation package                                                                                     | 22           |
| 2.1 Import of SW projects in Xilinx SDK 2015.4                                                                                | 22           |
| 2.2 Test demos                                                                                                                | 26           |
| 2.3 Synchronisation of ARM C/C++ code with video processing HW accelerators                                                   | 36           |
| User defined synchronisation with parallel HW data paths (barrier)                                                            |              |
| Internal synchronisation with parallel HW data paths                                                                          |              |
| 2.5 EdkDSP C compiler                                                                                                         |              |
| ·                                                                                                                             |              |
| 2.6 Debug of EdkDSP accelerator firmware with In-circuit Logic Analyser (ILA)  Debug ports of the (8xSIMD) EdkDSP accelerator |              |
| 2.7 Use of In-circuit Logic Analyser (ILA)                                                                                    | 50           |
| 3. Conclusions                                                                                                                | 57           |
| 4. References                                                                                                                 | 58           |
|                                                                                                                               |              |
| 5. The evaluation version of the package can be downloaded from UTIA www pages [14] free of charge                            | 59           |
| 6. Vivado projects with the evaluation version of the (8xSIMD) EdkDSP IP for the Artemis EMC2 project partn                   | ers61        |
| 7. Vivado projects with the release version of the (8xSIMD) EdkDSP IP                                                         | 63           |
| Disclaimer                                                                                                                    | 65           |
|                                                                                                                               |              |

signal processing



## **Table of figures**

| Figure 1: EMC2-DP-V2 carrier, TE0715-03-30-1I module. Edge detection demo with 3 HW data paths      | and LMS  |
|-----------------------------------------------------------------------------------------------------|----------|
| filter in one of the three (8xSIMD) EdkDSP accelerators and the ILA debugger                        |          |
| Figure 2: Full HD HDMII-HDMIO platform with three (8xSIMD) EdkDSP accelerators                      |          |
| Figure 3: Project sh01 - Acceleration and HW resources used                                         |          |
| Figure 4: Power consumption of sh01 demo without ILA                                                |          |
| Figure 5: Power consumption of sh01 demo with ILA                                                   |          |
| Figure 6: Project sh02 – Acceleration and HW resources used                                         |          |
| Figure 7: Power consumption of sh02 demo without ILA                                                |          |
| Figure 8: Power consumption of sh02 demo with ILA                                                   | 13       |
| Figure 9: Project sh03 - Acceleration and HW resources used                                         | 14       |
| Figure 10: Power consumption of sh03 demo without ILA                                               | 15       |
| Figure 11: Power consumption of sh03 demo with ILA                                                  | 15       |
| Figure 12: Project md01 - Acceleration and HW resources used                                        | 16       |
| Figure 13: Power consumption of md01 demo without ILA                                               | 17       |
| Figure 14: Power consumption of md01 demo with ILA                                                  | 17       |
| Figure 15: EMC2-DP-V2 Set (1.8V) before applying power to the board                                 | 20       |
| Figure 16: RS1 header for RS232 cable connection to the EMC2-DP-V2                                  |          |
| Figure 17: Location and setting of JP7 and JP8 (1.8V) on EMC2-DP-V2 carrier board                   | 21       |
| Figure 18: RS232 3-wire cable connected to the RS1 header of the EMC2-DP-V2 carrier board           |          |
| Figure 19: Select the SDK Workspace                                                                 | 22       |
| Figure 20: Import Existing Projects into Workspace                                                  | 23       |
| Figure 21: Select "Copy projects into workspace" and finish the import of all projects              |          |
| Figure 22: All projects are compiled in debug mode                                                  | 25       |
| Figure 23: EMC2-DP-V2 FMC side -USB (ARM terminal), RS232 (MicroBlaze terminal), HDMI I/O           |          |
| Figure 24: Download bitstream to the PL part of Zynq                                                | 28       |
| Figure 25: Select demo application for debug                                                        | 28       |
| Figure 26: Demo is booted to the ARM and the debugger is waiting on the first executable line       |          |
| Figure 27: ARM is waiting on HW Mutex for the MicroBlaze start.                                     |          |
| Figure 28: Select the MicroBlaze application (with the EdkDSP accelerator code) for debug           | 30       |
| Figure 29: MicroBlaze application is loaded and debugger stops on the first instruction             |          |
| Figure 30: ARM is running. It indicates the number of frames per second                             |          |
| Figure 31: MicroBlaze is running. Debug version. It indicates MFLOPs                                |          |
| Figure 32: MicroBlaze is running. Release version. It indicates MFLOPs                              |          |
| Figure 33: HW accelerated motion detection video processing (md01) performed in parallel with th    | e EdkDSP |
| accelerators                                                                                        |          |
| Figure 34: Listing of ARM C function using the internal synchronisation with parallel HW data paths | 37       |
| Figure 35: Listing of ARM C function with fixed data width interfacing HW pipeline of accelerators  |          |
| Figure 36: Select the Ubuntu_EdkDSP image in the VMware Player and click "Play"                     |          |
| Figure 37: Compilation of EdkDSP firmware in Ubuntu.                                                |          |
| Figure 38: C listing of the LMS filter firmware for the EdkDSP                                      |          |
| Figure 39: C listing of the FIR filter firmware for the EdkDSP                                      |          |
| Figure 40: Change the default hw_platform_0 to the hw_platform_0_ila                                |          |
| Figure 41: Debug ports of the (8xSIMD) EdkDSP floating point accelerator IP core                    |          |



| Figure 42: Vivado Lab Edition 2015.4                                                               | 50 |
|----------------------------------------------------------------------------------------------------|----|
| Figure 43: Select Open Target                                                                      | 51 |
| Figure 44: Select file with definition of probes present in HW                                     | 51 |
| Figure 45: FIR filter waveforms after the trigger in Vivado Lab Edition 2015.4                     | 53 |
| Figure 46: LMS filter waveforms after the trigger in Vivado Lab Edition 2015.4                     | 54 |
| Figure 47: Separate dashboard with display of temperature and voltage in Vivado Lab Edition 2015.4 | 55 |
| Figure 48: Accelerated video processing algorithm and ILA debug of the accelerated LMS filter      | 56 |

## **Table of Tables**

| Table 1: EMC2-DP V2 without modification with Artix TE0712 SoM modules                      | 19 |
|---------------------------------------------------------------------------------------------|----|
| Table 2: Problem in EMC2-DP V2 without modification with Zynq TE0715-15 or TE0715-30 module | 19 |
| Table 3: EMC2-DP V2 with modification with Zynq TE0715-15 or TE0715-30 module               | 19 |
| Table 4: API for MicroBlaze C code                                                          | 39 |
| Table 5: EdkDSP accelerator I/O API functions used by the PicoBlaze6 controller firmware    | 40 |
| Table 6: (8xSIMD) EdkDSP bce_fp12_1x8_40 accelerator vector operations                      | 49 |



## 1. Summary

This application note describes demos of HW accelerated Full HD HDMI video processing and HW accelerated floating point filters computed in (8xSIMD) EdkDSP accelerators on the largest Zynq platform with the Kintex PL fabric supported by the free Xilinx Vivado 2015.4 and SDK 2015.4 tool chain:

- 3 edge detection video processing designs (sh01, sh02, sh03)
  - These demos document the possibility to define different HW paths by different source C/C++ functions. This is important for covering of the borders lines of the parallel processed parts of the frame.
  - HW accelerators can be programmed for the number of processed micro-lines.
  - These demos enable efficient, synchronised parallel execution of accelerated data paths and ARM Cortex A9 standalone C code.
- 1 motion detection video processing design (md01)
  - o This demonstrates the pipelined parallel execution of HW video processing accelerators.
  - HW accelerators work with fixed number of processed micro-lines (1080 micro-lines) in this case.

Each Full HD demo includes also the HW accelerated computation of two DSP filters. These single precision floating point filters are computed on one of the three (8xSIMD) EdkDSP run-time reprogrammable single precision floating point accelerators with these properties:

- C programs can be compiled for single MicroBlaze processor and for one of the three EdkDSP accelerators. Compiled C (and ASM) code can be executed by the accelerators, without the need to recompile the design in Vivado 2015.4 [11].
- C programs for the MicroBlaze processor and for the three (8xSIMD) EdkDSP accelerators can be edited in the same SDK 2015.4 environment used for ARM Cortex A9 programming and debug.
- The three EdkDSP accelerators can run different programs in parallel and perform run-time change of tasks, task migration.
- Design is supporting the run-time re-programming of each of the (8xSIMD) EdkDSP acceleratorrs, under the control of the user-defined MicroBlaze C program.
- The MicroBlaze processor executes its program and utilizes data located in the top 256 MBytes of the 1Gbyte DDR3 memory. This region is also accessible by ARM processor. ARM initiates and controls the content of program and data executed by the MicroBlaze.
- ARM and MicroBlaze programs use HW mutex for synchronization.

#### 1.1 Full HD Platform with HDMI I/O and three EdkDSP Accelerators

This application note describes HW platform performing integration of three runtime reprogrammable (8xSIMD) EdkDSP floating point accelerators. These accelerators work in parallel with an edge detection or motion detection video processing algorithms. Source of video data is an HDMI input with resolution 1920x1080p60 (Full HD). The platform is composed from these HW building blocks (boards):

- Zyng XC7Z030-1I device on System on Module TE0715-03-30-1I from Trenz-electronic [1], ([2], [3], [4]).
- Carrier board EMC2-DP-V2 with FMC connector from Sundance [6], [7].
- AES-FMC-HDMI-CAM-G FMC HDMI I/O extension board from Avnet [8].
- RS232 serial interface.

All implemented Full HD video processing algorithms have been developed, debugged and tested in Xilinx SDSoC 2015.4 environment [12]. SW algorithms have been compiled by Xilinx SDSoC 2015.4 system level compiler (based on the Xilinx HLS compiler) to Vivado 2015.4 HW projects, and compiled by Xilinx Vivado 2015.4 [11] to bitstreams for Zynq XC7Z030-1I device. Created SW access functions controlling the HW accelerators have been exported from the Xilinx SDSoC 2015.4 environment to the Xilinx SDK 2015.4 [11] SW projects as static libraries for the standalone ARM Cortex A9 processor C programs.

signal processing http://zs.utia.cas.cz





Figure 1: EMC2-DP-V2 carrier, TE0715-03-30-1I module. Edge detection demo with 3 HW data paths and LMS filter in one of the three (8xSIMD) EdkDSP accelerators and the ILA debugger.

#### **Architecture**

The Xilinx Zynq device XC7Z030-1I has two ARM Cortex A9 processors (operating at 666 MHz). Memory controller of Zynq device provides DDR3 memory access ports for ARM processors as well as to the reprogrammable logic. The Zynq device provides the programmable logic (PL) used for:

- 1. Three UTIA EdkDSP (8xSIMD) floating point processors (operating at 150 MHz) connected to Xilinx MicroBlaze 32bit processor (operating at 125 MHz).
- 2. Input chain of video processing Full HD data to input video frame buffers. The input video DMA (VDMA) controller is operating at 150 MHz.
- 3. Video processing HW accelerators and data movers defined in Xilinx SDSoC 2015.4 environment. These accelerators are controlled from the ARM Cortex A9 C programs compiled in SDK 2015.4 C projects. These HW accelerators are operating at 200 MHz.
- 4. Chain of output video processing IPs connects output frame buffers to the Full HD display by HDMI cable. The output VDMA controller is operating at 150 MHz.
- Three EdkDSP is 8xSIMD floating point accelerators are reprogrammable in runtime by change of firmware
  for the build-in PicoBlaze6 8bit controllers. These controllers are serving as schedulers of vector operations
  performed in the EdkDSP is 8xSIMD floating point data paths. These schedulers are programmed by simple C
  programs compiled by UTIA C compiler and assembler. These compilers respect the minimal resources of
  the PicoBlaze6 controllers.

signal processing

- The three EdkDSP 8xSIMD floating point accelerators are controlled by single 32 bit MicroBlaze processor. The MicroBlaze processor executes larger C programs from the DDR3 memory. Algorithms can benefit from execution of selected operations on three EdkDSP coprocessors. The EdkDSP coprocessors are connected to the MicroBlaze by local dual ported memories.
- MicroBlaze C program can take benefit of the potential overlap of data communication from DDR3 to the EdkDSP dual-ported memories (managed by the MicroBlaze processor) and the parallel computations performed in the three EdkDSP accelerators and controlled locally by the three PicoBlaze6 sequencers.
- All designs include also the video processing chain of Full HD I/O IPs controlled by the ARM processor via the Axi-lite control bus operating at 125 MHz.
- ARM Cortex A9 processor performs the global initialization and synchronisation of the video processing chain. The Arm program and the FPGA image is downloaded to the board from the Xilinx SDK 2015.4 via USB JTAG to the 1GB DDR3 located on the Zyng system on module.
- System can be also started directly from the SD card with help of the ARM FSBL loader. ARM processor
  performs the initialization of program for the MicroBlaze processor. This MicroBlaze program contains also
  the initial firmware for the three EdkDSP accelerators. The ARM processor also initiates the HDMI video
  input and video output IPs.



Figure 2: Full HD HDMII-HDMIO platform with three (8xSIMD) EdkDSP accelerators.



#### Configurations of video processing Accelertors and EdkDSP Accelerators:

Next sections describe used resources and the acceleration for these configurations:

- The MicroBlaze with the three (8xSIMD) EdkDSP are present in HW. The MicroBlaze computes floating point FIR filter on one of the three EdkDSP accelerators and the video accelerator chain computes selected video processing algorithm in HW.
- The MicroBlaze with the three (8xSIMD) EdkDSP are present in HW. The MicroBlaze computes floating point LMS adaptive filter on one of the three EdkDSP acceleratos and the video accelerator chain computes selected video processing algorithm in HW.
- The MicroBlaze with the three (8xSIMD) EdkDSP are present in HW. The MicroBlaze computes in SW (only with its internal HW floating point unit) the FIR or LMS filter in parallel to the dedicated video processing accelerator HW chain. None of the three (8xSIMD) EdkDSP accelerators is used.
- MicroBlaze is present in HW. It computes in SW (only with its internal HW floating point unit) the FIR or LMS filter in parallel to the dedicated video processing accelerator HW chain. The (8xSIMD) EdkDSP accelerators are not present in the PL logic.

Video processing algorithms have been implemented in SW by ARM Cortex A9 processor, first. The ARM C/C++ code was compiled with -O3 optimisation (but without use of the NEON accelerator) in the SDSoC 2015.4 environment [12]. The related HW resources include the MicroBlaze with the three (8xSIMD) EdkDSP present in the PL part of Zyng device and only the basic HDMI input and output HW support.

The evaluation designs with HW accelerators have been created from the selected C/C++ functions in SDSoC 2015.4 environment. New HW design have been generated and exported into the final set of SDK 2015.4 projects. The resulting demos are included in the evaluation package.

#### General setup of all demos:

- ARM Cortex A9 processor of Xilinx Zynq device XC7Z030-1I executes standalone C application programs performing initialisation and synchronisation of the HW accelerated video processing chains.
- Enclosed C programs for ARM, MicroBlaze and PicoBlaze6 sequencers can be modified by the user and recompiled in single Xilinx SDK 2015.4 development framework.
- Video signal input with resolution 1920x1080p60.
- Data are processed in HW into the YCrCb 4:2:2 (16 bit per pixel) format and stored by video DMA (VDMA) controller to input video frame buffers (VFBs) reserved in the DDR3.
- HW DMA controller(s) send data from the input VFBs to the processing HW accelerators in the programmable logic (PL) part of Zynq.
- Another HW DMA controller(s) send processed data from HW to output VFBs in DDR3.
- Second part of the HW VDMA writes data to the Full HD display with HDMI.

#### 1.2 Demos

#### Main objectives of demos:

- To demonstrate how to install, compile, modify and use the enclosed SW projects in the SDK 2015.4.
- To demonstrate the HW accelerated video processing algorithms and the acceleration in comparison to the original ARM Cortex A9 SW versions of video processing algorithms.
- To demonstrate parallel execution of predefined video processing HW paths with C user code on ARM.
- To demonstrate HW accelerated video processing algorithms and the accelerated floating point FIR/LMS filters computed by the 8xSIMD EdkDSP run-time re-programmable floating point accelerator.
- To evaluate power consumption of several system configurations.





#### **Edge detection**

The edge detection algorithm detects edges in each frame are marked as white and remaining part of the figure is set as black.

The edges are detected by a Sobel filter. Each pixel is filtered by a 3x3 2D FIR filter. A nonlinear decision on the output of the filter provides decision if the pixel is part of an edge or not. All computation is performed in fixed point. Input to the Sobel filter is the video signal with each pixel converted to the monochrome 8bit format.

Demos **sh01**, **sh02** and **sh03** provide accelerated HW computation of edge detection with 1, 2 or 3 parallel HW data paths. Computation of horizontal border line is resolved in case of sh02 and sh03. All these demos support synchronised parallel execution of user defined C code on ARM while the HW data paths perform accelerated video processing.

HW demos are using 1, 2 or 3 DMA HW channels as input from DDR3 to 1, 2 or 3 Sobel filters. Another 1, 2 or 3 DMA HW channels support output from Sobel filters to the DDR3. Demos are linked with static libraries libsh01.a, libsh02.a or libsh03.a.

#### Motion detection

The motion detection algorithm detects and performs visualisation of moving edges. The moving edges are identified by two Sobel filters performing FIR filtering (similar to the above described edge detection) on pixels with identical coordinates but from two subsequent video frames. A difference of these filtered results is computed. This difference signal is finally filtered by the median filter.

Resulting signal is used for the nonlinear binary decision if the analysed pixel is part of a moving edge or not. If the pixel is part of a moving edge, it is assigned red colour and merged with the original colour video signal. Resulting output video signal is unchanged, with the exception of red colour marked moving edges.

Demo **md01** provides accelerated HW computation with one parallel HW data path. HW demo is using 2 DMA HW channels for reading from two sub sequent video frame buffers located both in the DDR3 to the video processing chain of accelerators performing the motion detection. Another DMA HW channel performs parallel write of results to the DDR3. Demo is linked with static library libmd01.a.

#### 1.3 Measurements of acceleration and resources used in included projects

The acceleration results have been measured as a ratio of the frame per second (FPS) reached by the accelerator and the FPS reached by the initial SW implementation on ARM in the SDSoC 2015.4.

In case of SW implementation –O3 optimisation was used. HW support for the HDMI I/O data movement by the dedicated VDMA HW channels was used in all cases.



#### 1.4 Project sh01: Edge detection with single HW accelerator and 3x EdkDSP

The accelerated data path runs at 200 MHz and includes HW version of the function:

sobel\_filter\_htile1

with one data-mover controlling one input and one output AXI DMA channel to the DDR3.

## Acceleration by HW: 6.77 X









#### TE0715-03-30-1I (with ILA) Sobel 1x





Figure 3: Project sh01 - Acceleration and HW resources used.

#### **Power consumption**

Power consumption of complete working system has been measured. See Figure 4 and Figure 5:

- ARM A9 + video I/O and HW accelerators + MicroBlaze
- ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW instantiated

10/65

ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW filters are present.
 One instantiated EdkDSP accelerators is computing the LMS or FIR filter in floating point:

One EdkDSP HW accelerator computing LMS Filter: 429 mW/GFLOP/s

One EdkDSP HW accelerator computing FIR Filter:
 293 mW/GFLOP/s

One EdkDSP HW accelerator computing LMS Filter with ILA: 442 mW/GFLOP/s

One EdkDSP HW accelerator computing FIR Filter with ILA: 302 mW/GFLOP/s





Figure 4: Power consumption of sh01 demo without ILA



Figure 5: Power consumption of sh01 demo with ILA



#### 1.5 Project sh02: Edge detection with two HW accelerators and 3x EdkDSP

The two parallel accelerated data paths run at 200 MHz and include HW version of these functions:

- sobel\_filter\_htile1
- sobel\_filter\_htile2

with two data-movers controlling two input and two output AXI DMA channels to the DDR3.

### Acceleration by HW: 8.33 X

#### TE0715-03-30-1I (no ILA) Sobel 2x





#### TE0715-03-30-1I (with ILA) Sobel 2x





Figure 6: Project sh02 – Acceleration and HW resources used.

#### Power consumption:

Power consumption of complete working system has been measured. See Figure 7 and Figure 8:

- ARM A9 + video I/O and HW accelerators + MicroBlaze
- ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW instantiated
- ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW filters are present.
   One instantiated EdkDSP accelerators is computing the LMS or FIR filter in floating point:

One EdkDSP HW accelerator computing LMS Filter: 420 mW/GFLOP/s

- One EdkDSP HW accelerator computing FIR Filter:
   288 mW/GFLOP/s
- One EdkDSP HW accelerator computing LMS Filter with ILA: 438 mW/GFLOP/s
- One EdkDSP HW accelerator computing FIR Filter with ILA: 299 mW/GFLOP/s

12/65





Figure 7: Power consumption of sh02 demo without ILA



Figure 8: Power consumption of sh02 demo with ILA



ŪTĬA

#### 1.6 Project sh03: Edge detection with three HW accelerators and 3x EdkDSP

The three parallel accelerated data paths run at 200 MHz and include HW version of these functions:

- sobel\_filter\_htile1
- sobel\_filter\_htile2
- sobel filter htile3

with three data-movers controlling three input and three output AXI DMA channels to the DDR3.

## Acceleration by HW: 8.38 X

#### TE0715-03-30-1I (no ILA) Sobel 3x







#### TE0715-03-30-1I (with ILA) Sobel 3x





Figure 9: Project sh03 - Acceleration and HW resources used.

#### **Power consumption**

Power consumption of complete working system has been measured. See Figure 10 and Figure 11:

- ARM A9 + video I/O and HW accelerators + MicroBlaze
- ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW instantiated

14/65

ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW filters are present.
 One instantiated EdkDSP accelerators is computing the LMS or FIR filter in floating point:

o One EdkDSP HW accelerator computing LMS Filter: 420 mW/GFLOP/s

One EdkDSP HW accelerator computing FIR Filter:
 288 mW/GFLOP/s

o One EdkDSP HW accelerator computing LMS Filter with ILA: 433 mW/GFLOP/s

o One EdkDSP HW accelerator computing FIR Filter with ILA: 296 mW/GFLOP/s





Figure 10: Power consumption of sh03 demo without ILA



Figure 11: Power consumption of sh03 demo with ILA

15/65



#### 1.7 Project md01: Motion detection with chain of HW accelerators and 3x EdkDSP

Motion detection is implemented as single accelerated data path run at 200 MHz. It processes data from two subsequent video frames and it includes chain of HW versions of these functions:

- (1) pad (two instances)
- (2) sobel filter pass sobel filter
- (3) diff\_image

- (4) median char filter pass
- (5) combo image

(6) ext

four data-movers controlling the 200 MHz input AXI DMA channels from the DDR3 and one data-mover controlling the 200 MHz output AXI DMA channel to the DDR3.

## Acceleration by HW: 35.45 X

#### TE0715-03-30-1I (no ILA) Motion Det. 1x





#### TE0715-03-30-1I (with ILA) Motion Det. 1x





Figure 12: Project md01 - Acceleration and HW resources used

#### **Power consumption**

Power consumption of complete working system has been measured. See Figure 13 and Figure 14:

- ARM A9 + video I/O and HW accelerators + MicroBlaze
- ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW instantiated

16/65

ARM A9 + video I/O and HW accelerators + MicroBlaze + 3x EdkDSP HW filters are present.
 One instantiated EdkDSP accelerators is computing the LMS or FIR filter in floating point:

One EdkDSP HW accelerator computing LMS Filter: 433 mW/GFLOP/s

- One EdkDSP HW accelerator computing FIR Filter:
   296 mW/GFLOP/s
- One EdkDSP HW accelerator computing LMS Filter with ILA: 425 mW/GFLOP/s
- o One EdkDSP HW accelerator computing FIR Filter with ILA: 290 mW/GFLOP/s





Figure 13: Power consumption of md01 demo without ILA



Figure 14: Power consumption of md01 demo with ILA



#### 1.8 Floating point performance

This chapter summarises measured sustained single precision floating point performance of the system in parallel to the accelerated video processing:

**1411 MFLOP/s** on the 150 MHz (8xSIMD) EdkDSP (FIR filter in floating point on single 8xSIMD accelerator). **12 MFLOP/s** on the 125 MHz MicroBlaze processor (with the MB single precision floating point unit in HW).

The controllers inside each (8xSIMD) EdkDSP accelerators are reprogrammed by the firmware compiled from C code with the use of the UTIA EDKDSP C compiler. Each accelerator can be programmed with two firmware programs. Designs can swap firmware in the runtime in only few clock cycles. The alternative firmware can be downloaded to the (8xSIMD) EdkDSP accelerator controllers in parallel with the execution of the current firmware.

This is demonstrated by swap of the firmware for the FIR filter (room response) to the firmware for adaptive LMS identification of the filter coefficients in the acoustic noise cancellation demo. This also demonstrates the mechanism and support for the move from one task to another task on the same accelerator.

Each of the three (8xSIMD) EdkDSP accelerators can deliver single-precision floating point results, which are bit-exact identical to the reference software implementation running on the MicroBlaze with the Xilinx HW single precision floating point unit.

#### 1.9 Summary

The 28nm Kintex-based programmable logic part of the Zynq XC7Z030-1I device is capable of implementation in **three** UTIA (8xSIMD) EdkDSP floating point accelerators together with the Full HD video processing chain for the real-time video processing.

The combination of single 32bit MicroBlaze with three instances of the (8xSIMD) EdkDSP single precision floating point accelerators brings additional capability to compute floating point operations (single precision) with the performance **1411 MFLOP/s** (in case of FIR filter) on single (8xSIMD) EdkDSP accelerator at the expense of relatively moderate increase of total power consumption of the system.

Instantiation of the 125 MHz MicroBlaze processor with three instances of the 150MHz (8xSIMD) EdkDSP accelerators enables to work with the triple redundancy and, in parallel, execute HW accelerated video processing algorithms. The optional in-circuit logic analyser (ILA) is capable of triggering and visualizing up to 32k of data samples at 150MHz clock rate for the first of the three (8xSIMD) EdkDSP accelerators. This is very useful for debugging of sequences vector operations and addresses generated by the sequencer of the EdkDSP accelerator.

Designs debugged and developed in the high level SDSoC 2015.4 environment [12] are exported for the end-user in form of SDK 2015.4 [11] projects. The released evaluation package with SDK 2015.4 projects provide sufficient freedom for the end-user to make certain SW adaptations and customisations of the final application without the need to understand app low level details of accelerator IP cores, of the Vivado 2015.4 project. The initial SDSoC 2015.4 board support package is not needed in the released precompiled SDK 2015.4 projects. The SDSoC 2015.4 license is also not needed to run and modify them in the SDK 2015.4 SW projects.





#### 1.10 HW setup

HW setup uses commercially accessible components [1]-[8].

TE0715-03-30-1I; Part: XC7Z030-1SBG485I; 1GByte DDR; Industrial Grade [1] (or modules [2], [3], [4]).

Heatsink for TE0715, spring-loaded embedded [5].

**EMC2-DP-V2 Sundance Carrier Board** [6], [7].

AES-FMC-HDMI-CAM-G FMC card with HDMI I/O and CAM interface [8].

RS232 serial interface serves as the RS232C ASCII terminal for the MicroBlaze.

The unmodified EMC2-DP-V2 Carrier Board is supporting the TE0712 family of SoM modules with the Artix device XC7A100 and XC7A200. See *Table 1*.

Table 1: EMC2-DP V2 without modification with Artix TE0712 SoM modules

| TE0712   | TE0712      | JM1       | EMC2-DP V2   | JB1       | EMC2-DP V2  | FMC IMAGEON     |
|----------|-------------|-----------|--------------|-----------|-------------|-----------------|
| FPGA pin | signal name | module    | without      | Board     | signal name | card [8] signal |
|          |             | connector | modification | connector |             |                 |
| B17      | B16_L11_P   | 66        | <b>←</b> 0K  | 65 <- 65  | FMC_CLK1_P  | HDMII_CLK (in)  |
| B18      | B16_L11_N   | 68        | → OK         | 67 <- 67  | FMC_CLK1_N  | HDMIO_CLK (out) |

The EMC2-DP-V2 Carrier Board requires one modification to run the demos on AES-FMC-HDMI-CAM-G with Zynq TE0715-03-30-1C or TE0715-03-30-1I system on module. The modification is related to the swapped polarity of the differential clock signals for the AES-FMC-HDMI-CAM-G FMC board on the Zynq SoM module [1] (or modules [2], [3], [4]). See *Table 2*.

Table 2: Problem in EMC2-DP V2 without modification with Zynq TE0715-15 or TE0715-30 module

| TE0715-30 | TE0715-30   | JM1       | EMC2-DP V2    | JB1       | EMC2-DP V2  | FMC IMAGEON     |
|-----------|-------------|-----------|---------------|-----------|-------------|-----------------|
| TE0715-15 | TE0715-15   | module    | without       | Board     | signal name | card [8] signal |
| FPGA pin  | signal name | connector | modification  | connector |             |                 |
| AA17      | B12_L14_N   | 66        | <del>(</del>  | 65 <- 65  | FMC_CLK1_P  | HDMII_CLK (in)  |
| AA16      | B12_L14_P   | 68        | $\rightarrow$ | 67 <- 67  | FMC_CLK1_N  | HDMIO_CLK (out) |

The AES-FMC-HDMI-CAM-G FMC board drives the 148.5 MHz clock signal coming from the external HDMI source as single-ended clock signal HDMII\_CLK (in). This single ended input clock has to come to the AA16 pin of the TE0715-30 Zynq FPGA. The single ended clock input with such clock speed cannot be safely received via the AA17 pin of the FPGA. This modification can be done on the EMC2-DP V2 PCB close to the board PCB side of the JB1 connector. See *Table 3*.

Table 3: EMC2-DP V2 with modification with Zyng TE0715-15 or TE0715-30 module

|           |             |           | , , ,        |         |             |                 |
|-----------|-------------|-----------|--------------|---------|-------------|-----------------|
| TE0715-30 | TE0715-30   | JB1       | EMC2-DP V2   | JB1     | EMC2-DP V2  | FMC IMAGEON     |
| FPGA pin  | signal name | module    | with         | Board   | signal name | card [8] signal |
|           |             | connector | modification | fix:    |             |                 |
| AA17      | B12_L14_N   | 66        | → OK         | 65 x 67 | FMC_CLK1_N  | HDMIO_CLK (out) |
| AA16      | B12_L14_P   | 68        | ← ок         | 67 x 65 | FMC_CLK1_P  | HDMII_CLK (in)  |

UTIA can implement this necessary HW modification for the Sundance EMC2-DP-V2 carrier board [6]. This requires written e-mail request to <a href="kadlec@utia.cas.cz">kadlec@utia.cas.cz</a>. Request will be first confirmed by UTIA. The interested party has to cover the cost of the shipment of the original EMC2-DP-V2 carrier board. Modification can be done in 5 working days and it is offered free of charge.





The EMC2-DP V2 with modification as described in Table 3 supports HDMI input and output in Full HD with the FMC IMAGEON card [8] with all TE0720 modules, all TE0715-15 modules, all TE0715-30 modules [1], [2], [3], [4]) and the TE0715-04-30-3E module.

Please notice, that the modified EMC2-DP V2 card cannot anymore support the HDMI input in Full HD from the FMC IMAGEON card [8] with the TE0712 SoM modules.

This application note and the evaluation package support without any additional modification these HW options:

The **TE0715-03-30-1I** SoM [1] can be replaced by **TE0715-04-30-1I** [2] or **TE0715-03-30-1C** [3] or **TE0715-04-30-1C** [4]. Speed grades are identical. Identical bitstreams, HW and SW can be used. The **1I** modules have industrial temperature range of -40°C to +85°C. The **1C** modules have commercial temperature range of 0°C to +70°C. See the EMC2-DP-V2 technical reference manual [7] for the description of the EMC2-DP-V2 carrier board [6].

#### Set the EMC2-DP-V2 carrier board switches to 1.8V

The TE0715-03-30-1I Zynq device PL IO-bank supply-voltages **must be set to the 1.8V.** Higher voltage for the I/O banks is not possible for this device. It would cause a damage of the device! See position of jumpers JP7 and JP8 in Figure 15 and Figure 17. It is highly recommended to set app switches of the EMC2-DP-V2 carrier board and measure the PL IO-bank supply-voltage before mounting of the module on EMC2-DP-V2. It is recommended to avoid failures and damages to the functionality of the mounted module if the voltage for the HPF banks would be higher than the 1.8V.

|   | 0 |   | JP8A                                                      |
|---|---|---|-----------------------------------------------------------|
| 1 | 2 | 3 | JP8 - connect positions 2-3 of the jumper to get the 1.8V |
| 1 | 2 | 3 | JP7 - connect positions 2-3 of the jumper to get the 1.8V |
|   | 0 |   | JP7A                                                      |

Figure 15: EMC2-DP-V2 Set (1.8V) before applying power to the board

**USB terminal** for the ARM processor Is connected over the Sundance expansion board. It is using 115 200 bps.

**Serial RS232 terminal** for the MicroBlaze is connected to RS1 2, RS1 3 and RS1 3 pins of the RS1 header. See Figure 16, Figure 18. It connects to serial port on a PC, using a straight-through cable with DCE 9 pin connector as a 3-wire DTE serial port (with one transmitter for transmit-data signals, one receiver for receivedata signals, and a signal ground connection). It is using 115 200 bps.

| TE0715-30 | Description                                 | Name | RS1       |
|-----------|---------------------------------------------|------|-----------|
| FPGA pin  |                                             |      | connector |
|           |                                             | 3.3V | RS1 1     |
| J7        | From FPGA to PC (converted to RS232 levels) | TX1  | RS1 2     |
| M8        | From PC to FPGA (converted from RS232)      | RX1  | RS1 3     |
|           |                                             | GND  | RS1 4     |
| J6        | From FPGA to PC (converted to RS232 levels) | TX2  | RS1 5     |
| D1        | From PC to FPGA (converted from RS232)      | RX2  | RS1 6     |
|           |                                             | GND  | RS1 7     |
| M7        | From FPGA to PC (converted to RS232 levels) | TX3  | RS1 8     |
| C1        | From PC to FPGA (converted from RS232)      | RX3  | RS1 9     |

Figure 16: RS1 header for RS232 cable connection to the EMC2-DP-V2

20/65





Figure 17: Location and setting of JP7 and JP8 (1.8V) on EMC2-DP-V2 carrier board.



Figure 18: RS232 3-wire cable connected to the RS1 header of the EMC2-DP-V2 carrier board.

http://zs.utia.cas.cz

## 2. Installation of the evaluation package

#### 2.1 Import of SW projects in Xilinx SDK 2015.4

Unzip the evaluation package to directory of your choice. The directory C:\VM\_07 will be used in this application note. C:\VM\_07\s30i1hm4\_V54\_IMPORT

Create empty directory for Xilinx SDK workspace.

C:\VM\_07\s30i1hm4

Start Xilinx SDK 2015.4 and select the directory for the SDK 2015.4 workspace. See Figure 19. Select C:\VM\_07\s30i1hm4



Figure 19: Select the SDK Workspace

HW and SW projects can be imported into SDK now. Select:

File -> Import -> General -> Existing Projects into Workspace Click on Next button. See Figure 20.





Figure 20: Import Existing Projects into Workspace

Type the directory with projects to be imported. See Figure 21.

#### C:\VM\_07\s30i1hm4\_V54\_IMPORT

Set the "Copy projects into workspace" check box. Click on Finish button. See Figure 21.

Process of compilation will start automatically. This first compilation of all SDK SW projects can take several minutes to finish. It should finish without errors.





Figure 21: Select "Copy projects into workspace" and finish the import of all projects.

signal processing



Figure 22: All projects are compiled in debug mode.

SDK 2015.4 compiles SW of all imported demos in debug mode.



#### 2.2 Test demos

To test demos follow these steps:

- 1. Connect HDMI (or DVI) source by HDMI cable to the HDMI IN connector of the AES-FMC-HDMI-CAM-G.
- 2. Connect HDMI (or DVI) monitor by HDMI cable to the HDMI OUT on the AES-FMC-HDMI-CAM-G board.
- 3. Switch the monitor ON.
- 4. Connect the carrier board by USB-to-microUSB cable to PC to support serial terminal for ARM processor.
- 5. Connect the RS232 serial interface to the carrier board as indicated in Figure 23. Connect the RS232 cable to COM serial port of your PC as terminal for the MicroBlaze processor.
- 6. Connect the Xiilinx Platform Cable USB II JTAG for download of bitstreams and debug.
- 7. Connect power supply.



Figure 23: EMC2-DP-V2 FMC side -USB (ARM terminal), RS232 (MicroBlaze terminal), HDMI I/O

26/65



#### Start test by this sequence of steps:

- 1. Swith ON the board to generate the power-on reset of the board.
- 2. Download the bitstream via the Xiilinx Platform Cable USB II JTAG
- 3. In the SDK Debugger,
  - a. delete the debug version of the ARM code,
  - b. select the release version of the ARM code and download it to the DDR3 via the Xiilinx Platform Cable USB II JTAG
- 4. Start the USB serial terminal application on PC to be used by ARM
  - a. Use serial terminal client (PuTTY or similar) (USB emulated).
  - b. Set speed: 115200 baud; Data bits: 8; Stop bits: 1; Parity: None; Flow control: None.
- 5. Start the RS232 serial terminal on PC to be used by MicroBlaze
  - a. Use RS232 serial terminal link and PC client like PuTTY or similar.
  - b. Set speed: 115200 baud; Data bits: 8; Stop bits: 1; Parity: None; Flow control: None.
- 6. Start ARM code from the debugger. It starts the MicroBlaze looping at the first instruction.
- 7. In the SDK Debugger,
  - a. delete the debug version of the MicroBlaze code,
  - b. select the release version of the MicroBlaze code,
  - c. deselect all default options (like indicated in Figure 28) and
  - d. download the release version of MicroBlze code to the DDR3 via the Xiilinx Platform Cable USB II JTAG
  - e. Start MicroBlaze code from the debugger
- 8. Both processor execute release version of code and in parallel with the HW accelerated video processing.

After each test, close both serial terminal programs on the PC before doing the power OFF/ON of the board needed for the reset.

#### Example of the debug session:

Download bitstream to the board. Demo sh03\_rows\_resize\_25\_to\_100 will be used as an example.

The bitstream.bit for demo sh03 is located in the directory: C:\VM\_07\s30i1hm4\sh03\_hw\_platform\_0 Select

Program to download the bitstream to the PL part of Zynq via Xiilinx Platform Cable USB II JTAG.







Figure 24: Download bitstream to the PL part of Zyng.



signal processing

Figure 25: Select demo application for debug.



Figure 26: Demo is booted to the ARM and the debugger is waiting on the first executable line.



department of Figure 27: ARM is waiting on HW Mutex for the MicroBlaze start.

The debug perspective is opened and the <code>Debug\sh03\_rows\_resize\_25\_to\_100.elf program</code> can be debugged or started on the ARM core. See Figure 26. Start the Resume button (F8) of the program from the debugger. It starts to run the ARM application with output to the terminal window. The timer is started with 3ns resolution. CPUO: on terminal is indicating the output from the Core\_O of the dual core Cortex A9 of the ZYNQ. The ARM processor is running and waiting in a pooling loop for handshake with MicroBlaze. See Figure 27.

The ARM application **Debug\sh03\_rows\_resize\_25\_to\_100.elf** has prepared the initial waiting loop code for the MicroBlaze processor at the address 0x30000000 in the DDR3. The MicroBlaze has been released from reset by the ARM application. MicroBlaze is running the initial loop code at the address 0x30000000 now.

The MicroBlaze application **Debug\sh03\_edkdsp\_fp12\_1x8\_all.elf** will be loaded to the DDR3 memory in next steps.

The SDK debugger has to connect to the running MicroBlaze via JTAG in a second connection. The running MicroBlaze processor will be stopped under the jtag control. The MicroBlaze executable <code>Debug\sh03\_edkdsp\_fp12\_1x8\_all.elf</code> will be downloaded to DDR3 and the MicroBlaze will be started again by the JTAG in the created second debug instance in the same SDK 2015.4 interface.



Figure 28: Select the MicroBlaze application (with the EdkDSP accelerator code) for debug.



We are downloading the program for the MicroBlaze processor by the second JTAG connection, while the ARM processor is already running. To keep it running, change the default options of the Microblaze debug connection as follows:

- Unselect "Run ps7\_init"
- Unselect "Run ps7\_post\_config"
- Select No reset

Click on "Apply" button. See Figure 28.



Figure 29: MicroBlaze application is loaded and debugger stops on the first instruction.

Click on "Debug" to download the **Debug\sh03\_edkdsp\_fp12\_1x8\_all.elf** to the DDR3 as program for the MicroBlaze processor.



The debugger will download this code by JTAG second connection. The debugger will set a breakpoint at the first executable instruction of the debugged MicroBlaze code and stop the MicroBlaze execution of code at this initial breakpoint. See Figure 29. The debugging is in the following initial state:

- The ARM processor thread runs. It is testing the state of the HW Mutex.
- MicroBlaze thread is suspended at the initial, first instruction breakpoint hit. See Figure 29.

Click on the |> icon to start the execution of MicroBlaze. The SW hands hake between the ARM processor and the MicroBlaze processor (supported by the HW Mutex IP) is completed at this point. Both processors start to run. The ARM processor initiates the Full HD video processing IP cores. It controls in SW the status of VDMA units and it also sets correct pointers to the active video frame buffers. Video processing is performed by HW accelerators. Data are moved from video frame buffers to HW accelerators and back to output video frame buffers by HW data mover IPs. All HW IPs are configured by the ARM SW via the ARM Axi-Lite interface.

- The input data movers act as the HW masters controlling the DMA engines moving data from the DDR3 to the chain(s) of video processing IP cores.
- The output data movers act as the HW masters controlling the DMA engines moving data from the output of the chain(s) of video processing IP cores to the DDR3 output video frame buffers.

```
Lines: 339 ARM cycles: 6879728, FPS: 60.073547
Lines: 338 ARM cycles: 6859644, FPS: 60.077694
Lines: 337 ARM cycles: 6839394, FPS: 60.073540
Lines: 336 ARM cycles: 6819270, FPS: 60.073494
Lines: 335 ARM cycles: 6799010, FPS: 60.072952
Lines: 334 ARM cycles: 6778800, FPS: 60.072681
Lines: 333 ARM cycles: 6758642, FPS: 60.072510
```

Figure 30: ARM is running. It indicates the number of frames per second.

The MicroBlaze processor executes in parallel with the ARM CPU its program from the DDR3. It sets up the initial firmware and data for the (8xSIMD) EdkDSP floating point accelerator(s) while these accelerators are in the initial reset stage.

After the initialization stage is passed, the MicroBlaze program runs tests of all basic floating point operations which are supported by the (8xSIMD) EdkDPP accelerator(s) and verifies, if the (8xSIMD) EdkDSP results are bit-exact identical with the reference MicroBlaze results computed in SW by the MicroBlaze processor. See Figure 31.

In the next stage, the MicroBlaze processor reprograms the (8xSIMD) EdkDSP accelerators to perform an FIR filter program. The (8xSIMD) EdkDSP accelerator is processing the predefined input acoustic data (in floating point) delivered by the MicroBlaze processor from the DDR3 memory. As a next stage of the demo, the second firmware changes the (8xSIMD) EdkDSP schedule to perform an LMS adaptive filter, working again on the predefined I/O acoustic data (in floating point). Data are delivered by the MicroBlaze from/to the DDR3 memory.

The demo application – the acoustic data processing with (2000 coefficient FIR filter) and (2000 coefficient LMS identification of filter coefficients) - is computed in single precision floating point in the first (8xSIMD) EdkDSP accelerator with support by the MicroBlaze processor.

Finally, the same demo application (2000 coefficient FIR filter) and (2000 coefficient LMS identification of filter coefficients) is also computed (in single precision floating point) on MicroBlaze with support of the internal HW

32/65



floating point unit to verify, that the (8xSIMD) EdkDSP accelerator results are bit-exact identical to the MicroBlaze results.

The performance of the combination of the MicroBlaze processor with the (8xSIMD) EdkDSP accelerator is measured by HW timer IP. The timer is instantiated as a MicroBlaze AXI-Lite IP core. See Figure 31.

```
COM1 - PuTTY
MB0 : (EdkDSP 8xSIMD) Capabilities Worker1 = 13ffff
MB0 : (EdkDSP 8xSIMD) Capabilities Worker2 = 13ffff
       (EdkDSP 8xSIMD) Capabilities Worker3 = 13ffff
MBO : (HW FP unit
                    ) Far-end signal ...
MBO : (EdkDSP 8xSIMD) FIR room response ...
                                              1416 MFLOPs
MBO : (HW FP unit
                    ) Add near-end signal ...
MBO : (EdkDSP 8xSIMD) LMS Identification ...
                                               914 MFLOPs
                    ) LMS Identification ...
MBO : (HW FP unit
                                                4 MFLOPs
      (EdkDSP 8xSIMD) OK
MB0 :
MBO : (EdkDSP 8xSIMD) Write firmware ...
MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
MB0 : (EdkDSP 8xSIMD) Capabilities2 = 13ffff
MBO : (EdkDSP 8xSIMD) Capabilities3 = 13ffff
MB0 : (EdkDSP 8xSIMD) VZ2A 'worker1'
      (EdkDSP 8xSIMD) VB2A 'worker1'
       (EdkDSP 8xSIMD) VZ2B 'worker1'
MB0 : (EdkDSP 8xSIMD) VA2B 'worker1'
MB0 : (EdkDSP 8xSIMD) VADD 'worker1'
MBO : (EdkDSP 8xSIMD) VADD BZ2A 'worker1'
MB0 : (EdkDSP 8xSIMD) VADD AZ2B 'worker1'
MB0 : (EdkDSP 8xSIMD) VSUB 'worker1'
      (EdkDSP 8xSIMD) VSUB BZ2A 'worker1'
       (EdkDSP 8xSIMD) VSUB AZ2B 'worker1'
MB0 : (EdkDSP 8xSIMD) VMULT 'worker1'
MBO : (EdkDSP 8xSIMD) VMULT BZ2A 'worker1' . OK
MB0 : (EdkDSP 8xSIMD) VMULT AZ2B 'worker1' . OK
MB0 : (EdkDSP 8xSIMD) VPROD 'worker1'
MB0 : (EdkDSP 8xSIMD) VMAC 'worker1'
MB0 : (EdkDSP 8xSIMD) VMSUBAC 'worker1' .... OK
      (EdkDSP 8xSIMD) VPROD S8 'worker1' ... OK
MBO : (EdkDSP 8xSIMD) VDIV 'worker1' ..... OK
MBO : (EdkDSP 8xSIMD) Write firmware ...
MB0 : (EdkDSP 8xSIMD) Capabilities Worker1 = 13ffff
      (EdkDSP 8xSIMD) Capabilities Worker2 = 13ffff
       (EdkDSP
              8xSIMD) Capabilities Worker3 = 13ffff
MBO : (HW FP unit
                      Far-end signal ...
MBO : (EdkDSP 8xSIMD) FIR room response ...
                                              1416 MFLOPs
MBO : (HW FP unit
                    ) Add near-end signal ...
MBO : (EdkDSP 8xSIMD) LMS Identification ...
```

Figure 31: MicroBlaze is running. Debug version. It indicates MFLOPs.



```
_ D X
COM1 - PuTTY
MBO : (EdkDSP 8xSIMD) VDIV 'worker1' ..... OK
MBO : (EdkDSP 8xSIMD) Write firmware ...
      (EdkDSP 8xSIMD) Capabilities Worker1 = 13ffff
MB0 : (EdkDSP 8xSIMD) Capabilities Worker2 = 13ffff
MBO : (EdkDSP 8xSIMD) Capabilities Worker3 = 13ffff
MBO : (HW FP unit
                   ) Far-end signal ...
MBO : (EdkDSP 8xSIMD) FIR room response ... 1419 MFLOPs
MBO : (HW FP unit ) Add near-end signal ...
      (EdkDSP 8xSIMD) LMS Identification ... 914 MFLOPs
                    ) LMS Identification ...
       (HW FP unit
                                               12 MFLOPs
MB0 : (EdkDSP 8xSIMD) OK
MBO : (EdkDSP 8xSIMD) Write firmware ...
MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
MB0 : (EdkDSP 8xSIMD) Capabilities2 = 13ffff
MB0 : (EdkDSP 8xSIMD) Capabilities3 = 13ffff
      (EdkDSP 8xSIMD) VZ2A 'worker1'
      (EdkDSP 8xSIMD) VB2A 'worker1'
MBO :
MBO : (EdkDSP 8xSIMD) VZ2B 'worker1'
MBO : (EdkDSP 8xSIMD) VA2B 'worker1'
MBO : (EdkDSP 8xSIMD) VADD 'worker1'
MB0 : (EdkDSP 8xSIMD) VADD BZ2A 'worker1'
MBO : (EdkDSP 8xSIMD) VADD AZ2B 'worker1' .. OK
      (EdkDSP 8xSIMD) VSUB 'worker1'
      (EdkDSP 8xSIMD) VSUB BZ2A 'worker1' .. OK
MBO : (EdkDSP 8xSIMD) VSUB AZ2B 'worker1' .. OK
MB0 : (EdkDSP 8xSIMD) VMULT 'worker1' ..... OK
MBO : (EdkDSP 8xSIMD) VMULT BZ2A 'worker1' . OK
MBO : (EdkDSP 8xSIMD) VMULT AZ2B 'worker1' . OK
MBO : (EdkDSP 8xSIMD) VPROD 'worker1' ..... OK
MB0 :
      (EdkDSP 8xSIMD) VMAC 'worker1'
      (EdkDSP 8xSIMD) VMSUBAC 'worker1' .... OK
MBO : (EdkDSP 8xSIMD) VPROD S8 'worker1' ... OK
MBO : (EdkDSP 8xSIMD) VDIV 'worker1' ..... OK
MBO : (EdkDSP 8xSIMD) Write firmware ...
      (EdkDSP 8xSIMD) Capabilities Worker1 = 13ffff
       (EdkDSP 8xSIMD) Capabilities Worker2 = 13ffff
MBO : (EdkDSP 8xSIMD) Capabilities Worker3 = 13ffff
MBO : (HW FP unit
                    ) Far-end signal ...
MBO : (EdkDSP 8xSIMD) FIR room response ... 1417 MFLOPs
                    ) Add near-end signal ...
MBO : (HW FP unit
      (EdkDSP 8xSIMD) LMS Identification ...
                                              914 MFLOPs
```

Figure 32: MicroBlaze is running. Release version. It indicates MFLOPs.

The (8xSIMD) EdkDSP accelerators are named worker1 ... worker3. All 3 workers have identical capabilities.

Each of the two parallel running processors (ARM and MicroBlaze) can be stopped/resumed/terminated from the debugger.



Finally, terminate the debug session in the SDK 2015.4 debugger by this sequence of commands:

- 1. Stop MicroBlaze processor.
- 2. Stop ARM processor core 0.
- 3. Terminate MicroBlaze processor.
- 4. Terminate ARM processor.
- 5. Close the debug perspective.
- 6. Close both terminal windows
- 7. Switch OFF power of the ENC2-DP-V2 board

All evaluation demos can be also compiled into release versions with optimisation set to -O2 or -O3. These optimisations can be set independently for the ARM and for the MicroBlaze processor in the SDK 2015.4 project. Set both projects to the release target. Delete the debug versions of projects. This will trigger the compilation of release versions.

Start test by repeating the initial sequence of steps:

- 9. Swith ON the board to generate the power-on reset of the board.
- 10. Download the bitstream via the Xiilinx Platform Cable USB II JTAG
- 11. In the SDK Debugger,
  - a. delete the debug version of the ARM code,
  - select the release version of the ARM code and download it to the DDR3 via the Xiilinx Platform Cable USB II JTAG
- 12. Start the USB serial terminal application on PC to be used by ARM
- 13. Start the RS232 serial terminal on PC to be used by MicroBlazel
- 14. Start ARM code from the debugger. It starts the MicroBlaze looping at the first instruction
- 15. In the SDK Debugger,
  - a. delete the debug version of the MicroBlaze code,
  - b. select the release version of the MicroBlaze code,
  - c. deselect all default options (like indicated in Figure 28) and
  - d. download the release version of MicroBlze code to the DDR3 via the Xiilinx Platform Cable USB II JTAG
  - e. Start MicroBlaze code from the debugger
- 16. Both processor execute release version of code and in parallel with the HW accelerated video processing.

See MicroBlaze terminal in Figure 32. It is an example of the output from the release version of the same demo. After execution of the release version, terminate the release session as described above for the debug session.

- Demos like sh01\_rows\_fixed\_100 work on complete video frame (with single HW accelerator data path).
- Demos like sh01\_rows\_resize\_25\_to\_100 work with identical bitstream and HW video-processing accelerators, but the ARM SW is setting dynamically the number of lines to be processed for each new frame. In the demo, ARM scales the number of horizontal lines from ¼ of the frame to the complete frame. The HW data movers are instructed about the number of lines to be processed. Demo SW running on ARM is writing this information to the AXI-lite configuration registers of the data mover IP cores before start of processing of each frame.
- Please notice, that part of the frame which is not processed is propagated to the HDMI output via the cyclic structure of the 8 video frame buffers. See Figure 33 for an example of three parallel variable data paths.
- Demos sh02\_rows\_fixed\_100 and sh02\_rows\_resize\_25\_to\_100 work with 2 data paths.
- Demos sh03 rows fixed 100 and sh03 rows resize 25 to 100 work with 3 data paths. See Figure 33.
- Demo md01\_rows\_fixed\_100 works with one HW video processing chain with fixed set of processed lines.

35/65





Figure 33: HW accelerated motion detection video processing (md01) performed in parallel with the EdkDSP accelerators.

#### 2.3 Synchronisation of ARM C/C++ code with video processing HW accelerators

This section describes synchronisation of ARM C code with parallel video processing HW accelerators. Two cases of programming models are described.

36/65

- User defined synchronisation with parallel HW data paths.
- Internal synchronisation with parallel HW data paths.



#### User defined synchronisation with parallel HW data paths (barrier)

Consider **sh03\_rows\_resize\_25\_to\_100** project as an example. Three HW data paths perform edge detection in parallel on 3 separate areas of a DDR3 video frame. ARM C code is calling function. See Figure 34.

```
C:\VM_07\s30i1hm4\sh03_rows_resize_25_to_100\sobel\img_filters.c
#include <stdio.h>
#include "frame size.h"
#include "hw sobel.h"
void img process(unsigned short *fb in, unsigned short *fb out) {
#pragma SDS async(3)
     p0_sobel_filter_htile3_0(fb in + 2*(NUMTILEROWS-1)*NUMPADCOLS,
                 fb out + 2*(NUMTILEROWS-1)*NUMPADCOLS, NUMTILEROWS);
#pragma SDS async(2)
     _p0_sobel_filter_htile2_0(fb_in + (NUMTILEROWS-1)*NUMPADCOLS,
                 fb out + (NUMTILEROWS-1) *NUMPADCOLS, (NUMTILEROWS+1));
#pragma SDS async(1)
     p0 sobel filter htile1 0(fb in, fb out, (NUMTILEROWS+1));
// Parallel ARM code here
     sds wait(3);
     sds_wait(2);
     sds wait(1);
```

Figure 34: Listing of ARM C function using the internal synchronisation with parallel HW data paths.

```
The three functions:
_p0_sobel_filter_htile3_0() // Not blocking, Starts HW path 3
_p0_sobel_filter_htile2_0() // Not blocking, Starts HW path 2
_p0_sobel_filter_htile1_0() // Not blocking, Starts HW path 1
```

are corresponding to the three HW video acceleration data paths. These functions are independent. Each of functions only starts its HW data path. All three functions are not blocking. All three functions have been defined in the original SDSoC 2015.4 project with the #pragma SDS async and exported in the libsh03.a static library. The synchronisation point (similar to a barrier in case of SW threads) is implemented separately by three calls to the functions sds\_wait(3); sds\_wait(2); sds\_wait(1);. These functions are blocking and each of the functions terminates when the corresponding HW accelerated data path is done.

ARM processor can be programmed by user C code and this code can be executed in parallel to the started HW accelerated data paths. This parallel processing is implemented in a single SW thread.

The video processing speed will be unaffected, if the time needed for the ARM code segment is shorter than the time needed for the parallel, HW controlled data paths.

37/65





### Internal synchronisation with parallel HW data paths

Figure 35 presents interface to HW pipeline of accelerators with fixed data path used in md01 demo. The HW pipeline serves for direct communication of accelerators with the final synchronisation in function \_p0\_ext\_0().

The sequence of function calls in Figure 35 is fixed. It cannot be changed. It is related to the md01 HW pipeline.

```
C:\VM_07\s30i1hm4\md01_rows_fixed_100\motion_detect\img_filters.c
#include <stdio.h>
#include "frame size.h"
#include "hw motion detect.h"
unsigned short yc data prev[NUMROWS*NUMCOLS], yc data in[NUMROWS*NUMCOLS;
unsigned short yc out tmp1[NUMROWS*NUMCOLS], yc out tmp2[NUMROWS*NUMCOLS];
unsigned short yc_out_tmp3[NUMROWS*NUMCOLS], yc_out_tmp4[NUMROWS*NUMCOLS];
unsigned char sobel_curr[NUMROWS*NUMCOLS], sobel_prev[NUMROWS*NUMCOLS];
unsigned char motion image tmp1[NUMROWS*NUMCOLS];
unsigned char
               motion image tmp2[NUMROWS*NUMCOLS];
void img process (unsigned short *rgb data prev,
                 unsigned short *rgb_data_in,
                 unsigned short *rgb data out,
                 int param0, int param1, int param2) {
    unsigned char pass through;
    unsigned char threshold = 100;
    pass through = 0;
    p0 pad 1(rgb data prev, yc data prev);
    p0 pad 0(rgb data in, yc data in);
    p0 sobel filter pass 0(yc data in, sobel curr, yc out tmp1);
    p0 sobel filter 0(yc data prev, sobel prev);
    p0 diff image 0(sobel curr, sobel prev,yc out tmp1, yc out tmp2,
                     motion image tmp1);
    p0 median char filter pass 0(threshold, motion image tmp1,
                             yc_out_tmp2, motion_image_tmp2, yc_out_tmp3);
    p0 combo image 0(pass through, motion image tmp2, yc out tmp3,
                      yc out tmp4);
    p0 ext 0 (yc out tmp4, rgb data out);
```

Figure 35: Listing of ARM C function with fixed data width interfacing HW pipeline of accelerators.



### 2.4 EdkDSP C compiler API

The PicoBlaze6 controller acts as programmable finite state machine in the (8xSIMD) EdkDSP accelerator. It sets the sequences of wide instructions for the 8xSIMD floating point data path of the EdkDSP accelerator.

The EdkDSP accelerators are connected to Xilinx MicroBlaze. The Microblaze processor is responsible for implementation of desired sequences of operations composed of:

- accelerator firmware programming,
- starting and synchronization of accelerators
- data communication.

These operations are supported by the Worker Abstraction Layer API. The API functions are summarized in Table 4.

Table 4: API for MicroBlaze C code

| Francisco                                   | Passwiskian                                                                        |
|---------------------------------------------|------------------------------------------------------------------------------------|
| Function                                    | Description                                                                        |
|                                             | Init/Done functions                                                                |
| wal_init_worker                             | Initiate and claim worker in an application                                        |
| wal_done_worker                             | Cleanup and release data structures allocated for the worker                       |
| Basic control functions                     |                                                                                    |
| wal_reset_worker                            | Send hard reset to the worker. Set control part if the worker to the default state |
| wal_start_operation                         | Select and run preloaded firmware in the worker                                    |
| wal_end_operation                           | Send request to stop worker operation. It is followed by request to worker reset   |
| wal_is_busy                                 | Test if the worker is currently busy (It is a non-blocking operation)              |
| wal_mb2pb                                   | Set control word to the worker                                                     |
| wal_pb2mb                                   | Read status word of the worker                                                     |
| Functions for working with control memories |                                                                                    |
| wal_mb2cmem                                 | Copy block of data from MicroBlaze user defined C array to the control memory      |
|                                             | shared by worker with MicroBlaze (worker firmware, 2 memories)                     |
| wal_cmem2mb                                 | Copy block of data from control memory of worker shared with MicroBlaze to         |
|                                             | MicroBlaze user defined C array                                                    |
|                                             | Functions for working with data memories                                           |
| wal_mb2dmem                                 | Copy block of data from MicroBlaze user defined C array to selected data memory    |
|                                             | of the worker shared with the MicroBlaze (for 8xSIMD EdkDSP – 24 memories)         |
| wal_dmem2mb                                 | Copy block of data from the selected data memory of the worker shared with the     |
|                                             | MicroBlaze (for 8xSIMD EdkDSP – 24 memories)                                       |
|                                             | Common support functions                                                           |
| wal_set_firmware                            | Copy worker firmware to selected position                                          |
| wal_get_id                                  | Read worker ID                                                                     |
| wal_get_capabilities                        | Read worker capabilities                                                           |
| wal_get_license                             | Read worker license                                                                |





The last layer is the basic I/O library prepared for the (8xSIMD) EdkDSP accelerator for communication of its PicoBlaze6 controller with the MicroBlaze processor.

Each EdkDSP I/O API function has been optimised in assembler to provide low footprint and maximum performance at this low-level hardware layer. The EdkDSP I/O library functions are listed in Table 5.

Table 5: EdkDSP accelerator I/O API functions used by the PicoBlaze6 controller firmware

| Function          | Description                                                                        |
|-------------------|------------------------------------------------------------------------------------|
| mb2pb_read_data   | Read value from MicroBlaze (blocking, includes hands-hake with MicroBlaze)         |
| pb2mb_write       | Write data value from PicoBlaze to MicroBlaze (blocking, includes handshake with   |
|                   | MicroBlaze SW)                                                                     |
| pb2mb_eoc         | Write data value from PicoBlaze to MicroBlaze and indicate the end of string flag  |
|                   | (blocking, includes handshake with MicroBlaze)                                     |
| pb2mb_req_reset   | Write data value from PicoBlaze to MicroBlaze and indicate request from to reset   |
|                   | PicoBlaze (blocking, includes handshake with MicroBlaze)                           |
| pb2mb_reset       | Activate PicoBlaze reset from PicoBlaze program with MicroBlaze support            |
|                   | (blocking, includes handshake with MicroBlaze)                                     |
| led2pb            | Read PicoBlaze LED port                                                            |
| btn2pb            | Read PicoBlaze BTN port                                                            |
| hex_h             | Write hexadecimal ascii representation of the high 4bit of the input 8bit argument |
|                   | from PicoBlaze to MicroBlaze                                                       |
|                   | (blocking, includes handshake with MicroBlaze)                                     |
| hex_l             | Write hexadecimal ascii representation of the low 4bit of the input 8bit argument  |
|                   | from PicoBlaze to MicroBlaze                                                       |
|                   | (blocking, includes handshake with MicroBlaze)                                     |
| pb2dfu_set        | Write 8bit data to the PicoBlaze I/O port mem                                      |
| pb2dfu_wait4hw    | Wait for the end of the EdkDSP floating point, vector operation                    |
|                   | (blocking, waits for the end of the FP vector operation)                           |
| pb2lcd_ascii_char | Write to local 2x16 lcd display.                                                   |

## 2.5 EdkDSP C compiler

This section briefly describes how to use the UTIA EdkDSP C compiler. It cross-compiles (on PC) simple C programs for the PicoBlaze6 controller.

The evaluation package includes also precompiled firmware files for the PicoBlaze6 controller. These files can be used for the first evaluations of the EdkDSP accelerator before installation of the EdkDSP C cross compiler to user PC.

The UTIA EdkDSP C compiler is part of this evaluation package in form of Ubuntu binaries. The "VMware player" software with compatible Ubuntu image is needed to run the UTIA EdkDSP C compiler on Windows 7 PC.

The Ubuntu image needs two DVDs (8GB) for installation. That is why it is not included as part of the evaluation package. If you would need this image, write an email request to <a href="mailto:kadlec@utia.cas.cz">kadlec@utia.cas.cz</a> to get these two DVDs with correct Ubuntu image from UTIA (free of charge).

Install VMware Workstation 12 Player [10] on Win 7 64 bit PC.







Figure 36: Select the Ubuntu EdkDSP image in the VMware Player and click "Play".

Open the VMware Workstation 12 Player and select the "Ubuntu\_EdkDSP" image. The Ubuntu will start. Login as:

User: **devel** Pswd: **devuser** 

The PC directory C:\VM\_07 needs to be shared by Windows 7 with Ubuntu OS. In Windows 7, set the directory C:\VM\_07 and its subdirectories as shared with the \_\_vmware\_user\_\_ for Read and Write.

In Ubuntu, open terminal and mount the PC directory **C:\VM\_07** to Ubuntu by typing: **cd bin samba\_07.sh** 

The Windows 7 C:/VM\_07 directory is mounted to the Ubuntu OS as: /mnt/cdrive

In Ubuntu terminal, change the directory to:

#### /mnt/cdrive/s30i1hm4/edkdsp

The EdkDSP C compiler utilities have to be on the Ubuntu PATH. This is done by sourcing the **settings.sh** script in this directory.



Type in Ubuntu terminal (See Figure 37):

#### source settings.sh

In Ubuntu terminal, change the directory to the example directory: cd a

#### /mnt/cdrive/s30i1hm4/edkdsp/a\$

Provided C source code examples can be compiled by script **ca\_fp11.sh** with parameter **a**. Type in the Ubuntu terminal:

#### ca\_fp11.sh a

This will compile and assemble four C firmware programs to header files with the firmware binary code for the EdkDSP accelerator:

```
a_fp1101p0.c is compiled to fill_FA1101P0_program_store.h a_fp1101p1.c is compiled to fill_FA1101P1_program_store.h a_fp1124p0.c is compiled to fill_FA1124P0_program_store.h a fp1124p1.c is compiled to fill_FA1124P0_program_store.h
```

Figure 38 presents C source code for the firmware for computation of the LMS in the (8xSIMD) EdkDSP platform. The FIR filter source code is presented in Figure 39.

To use the compiled headers in the SDK project, copy and paste

```
edkdsp/a/fill_FA1101P0_program_store.h
edkdsp/a/fill_FA1101P1_program_store.h
edkdsp/a/fill_FA1124P0_program_store.h
edkdsp/a/fill_FA1124P0_program_store.h
```

to the SDK project directory (in case of sh03\_edkdsp\_fp12\_1x8\_all) to:

#### C:\VM 07\s30i1hm4\sh03 edkdsp fp12 1x8 all\src

Recompile the MicroBlaze project "sh03\_edkdsp\_fp12\_1x8\_all". The compiled firmware for the (8xSIMD) EdkDSP will be used by the MicroBlaze C code of the demo as data for the runtime (re)configurations of the (8xSIMD) EdkDSP accelerator PicoBlaze6 controller.

The run time change of firmware is demonstrated by swapping of firmware for computation of FIR and LMS filters in the EdkDSP accelerator in all included demos.

The evaluation design used in this application note works with three instances of the (8xSIMD) EdkDSP floating point accelerator IP cores **bce\_fp12\_1x8\_40**.

42/65



```
Ubuntu_EdkDSP - VMware Workstation 12 Player
                                                                          Player 🔻 📗 🔻 🖶 💢
                                                       🛟 Aplikace Místa Systém
                                                                             USA
                                 Ne, 19. úno, 20:52
                                                                             _ - ×
                  devel@ubuntu: /mnt/cdrive/t30i1hm4/edkdsp/a
Soubor Upravit Zobrazit Terminál Karty Nápověda
devel@ubuntu:~$ cd bin
devel@ubuntu:~/bin$ samba_07.sh
[sudo] password for devel:
Password:
devel@ubuntu:~/bin$ cd /mnt/cdrive/t30i1hm4/edkdsp
devel@ubuntu:/mnt/cdrive/t30i1hm4/edkdsp$ ls
a include lib settings.sh tools
devel@ubuntu:/mnt/cdrive/t30i1hm4/edkdsp$ source settings.sh
EdkDSP environment set to '/mnt/cdrive/t30i1hm4/edkdsp'
devel@ubuntu:/mnt/cdrive/t30i1hm4/edkdsp$ cd a
devel@ubuntu:/mnt/cdrive/t30i1hm4/edkdsp/a$ ls
                           fill FA1101PO program store.h
a_fp1101p0.c ca.sh
a fp1101p0.h FA1101P0.log fill FA1101P0 program store.m
a_fpl101pl.c FAl101P0.PSM fill_FAl101Pl_program_store.h
a_fpl101pl.h FA1101Pl.log fill_FA1101Pl_program_store.m
a_fpl124p0.c FA1101Pl.PSM fill_FA1124P0_program_store.h
a_fp1124p0.h FA1124P0.log fill_FA1124P0_program_store.m
a_fp1124p1.c FA1124P0.PSM fill_FA1124P1_program_store.h
a_fpl124pl.h FA1124Pl.log fill_FA1124Pl_program_store.m
           FA1124P1.PSM stdio fp11.h
devel@ubuntu:/mnt/cdrive/t30i1hm4/edkdsp/a$ ca fp11.sh a
EDKDSPCC : a_fpll0lp0.c ...
EDKDSPASM: FAllOlPO.PSM ...
Generated M function file in the M file ././fill_FAll01PO_program_store.m
Generated C header file in the H file ./fill_FA1101PO_program_store.h
EDKDSPCC : a_fpll0lpl.c ...
EDKDSPASM: FAllOlPl.PSM ...
Generated M function file in the M file ././fill_FAll01P1_program_store.m
Generated C header file in the H file ./fill_FAll01P1_program_store.h
EDKDSPCC: a fp1124p0.c...
EDKDSPASM: FAll24P0.PSM ...
Generated M function file in the M file ././fill_FA1124PO_program_store.m
Generated C header file in the H file ./fill FA1124PO program store.h
EDKDSPCC : a_fpl124pl.c ...
EDKDSPASM: FAll24Pl.PSM ...
Generated M function file in the M file ././fill_FAll24Pl_program_store.m
Generated C header file in the H file ./fill_FA1124P1_program_store.h
|devel@ubuntu:/mnt/cdrive/t30i1hm4/edkdsp/a$ ls
a fpll0lp0.c ca.sh
                            fill FA1101PO program store.h
a fpll0lp0.h FAll0lP0.log fill FAll0lP0 program store.m
a_fpl10lpl.c FAl10lP0.PSM fill_FAl10lPl_program_store.h
a_fpl101pl.h FA1101Pl.log fill_FA1101Pl_program_store.m
a_fp1124p0.c FA1101P1.PSM fill_FA1124P0 program_store.h
a_fp1124p0.h FA1124P0.log fill_FA1124P0_program_store.m
a_fp1124p1.c FA1124P0.PSM fill_FA1124P1_program_store.h
a_fp1124p1.h FA1124P1.log fill_FA1124P1_program_store.m
ca_fp11.sh FA1124P1.PSM stdio_fp11.h
devel@ubuntu:/mnt/cdrive/t30i1hm4/edkdsp/a$
     devel@ubuntu: /mnt/cd...
```



Figure 37: Compilation of EdkDSP firmware in Ubuntu.



Figure 38: C listing of the LMS filter firmware for the EdkDSP.

#### Note:

UTIA maintains four grades [10|20|30|40] of the (8xSIMD) EdkDSP accelerator IP. Cores differ in HW-supported vector floating point computing capabilities:

- bce\_fp12\_1x8\_10 is area optimized and supports local vector data transfers (HW supported 8xSIMD transfers inside of the accelerator IP) and vector floating point operations FPADD, FPSUB in 8xSIMD data paths.
- bce\_fp12\_1x8\_20 performs identical operations as bce\_fp12\_1x8\_10 plus the vector floating point MAC operations in 8xSIMD data paths. MAC is supported for length of vectors 1 up to 10. This accelerator is optimized for applications like floating point matrix multiplication with one row and column dimensions <=</li>

signat processing

- **bce\_fp12\_1x8\_30** supports identical operations as bce\_fp12\_1x8\_0\_20 plus HW accelerated computation the floating point vector by vector dot products performed in 8xSIMD data paths. It is optimized for parallel computation of up to 8 FIR or LMS filters, each with size up to 250 coefficients. It is also efficient in case of floating point matrix by matrix multiplications, where one of the dimensions is large (in the range from 11 to 250).
- **bce\_fp12\_1x8\_40** support identical operations as bce\_fp12\_1x8\_30 plus an additional HW support of dot product. It is computed in 8xSIMD data paths with HW-supported pipeline wind-up into single scalar result. This result is propagated into all 8 SIMD data planes.

All **bce\_fp12\_1x8\_[10|20|30|40]** accelerators IP cores support single data path for, pipelined, floating-point division (FPDIV) with vector operands taken from the first SIMD plain and the result vector propagated into all 8 SIMD data plains.

All **bce\_fp12\_1x8\_[10|20|30|40]** accelerator versions IP cores are suitable for applications like adaptive normalised NLMS filters, Square-root-free versions of adaptive RLS QR filters and Adaptive RLS LATTICE filters.





Figure 39: C listing of the FIR filter firmware for the EdkDSP.

#### 2.6 Debug of EdkDSP accelerator firmware with In-circuit Logic Analyser (ILA)

This application note includes evaluation version of designs design with ARM Cortex A9 processor, MicroBlaze processor controlling three (8xSIMD) EdkDSP accelerators. First of these accelerators is configured with the Xilinx In-circuit Logic Analyser (ILA) for debug in the Vivado 2015.4 Lab Edition tool. The tool can be downloaded from Xilinx support webpage [11] for free. The platform with three (8xSIMD) EdkDSP accelerators and ILA support is present in the directory:

C:\VM\_07\ s30i1hm4\sh03\_hw\_platform\_0\_ila

signal processing http://zs.utia.cas.cz



You can repeat all evaluation and compilation steps as described in this application note for the demos without ILA, but use the bitstream from the \*\_hw\_platfor\_0\_ila directory.

Example for the sh03 demo:

In SDK, select: Xilinx Tools -> Program FPGA select "sh03\_hw\_platform\_0\_ila" (instead of "sh03\_hw\_platform\_0"). Click on the "Program" button.



Figure 40: Change the default hw\_platform\_0 to the hw\_platform\_0\_ila.

The implemented In-Circuit Logic Analyser (ILA) stores 32k samples of all output of the (8xSIMD) EdkDSP Accelerator debug ports.

The debug ports provide the basic visibility of the vector (8xSIMD) EdkDSP accelerator. Prepared debug ILA environment provides synchronised time records of addresses and schedule of executed floating point vector operations.

Processed floating point data are not stored. These data can be better analysed in the MicroBlaze debugger. MicroBlaze and its debugger can access all dual-ported memories of the (8xSIMD) EdkDSP accelerator at synchronising points defined by programmer.





Figure 41: Debug ports of the (8xSIMD) EdkDSP floating point accelerator IP core.

## Debug ports of the (8xSIMD) EdkDSP accelerator

All debug ports are stored with depth of 32k samples with the sample frequency 150 MHz (see Figure 41):

| <ul><li>bce_atoa[0</li></ul> | :9] Memory A ad    | dress (addressing 1024 32 bit floating point values)                                                                                                                                       |
|------------------------------|--------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul><li>bce_atob[0</li></ul> | :9] Memory B ad    | dress (addressing 1024 32 bit floating point values)                                                                                                                                       |
| <ul><li>bce_atoz[0</li></ul> | 9] Memory Z ad     | dress (addressing 1024 32 bit floating point values)                                                                                                                                       |
| <ul><li>bce_done[0</li></ul> | ):7] Vector opera  | tion in progress or finished                                                                                                                                                               |
| <ul><li>bce_led4b[</li></ul> | 0:3] 4 bit output, | intended for led signalling. (Unconnected to external pins).                                                                                                                               |
| <ul><li>bce_mode</li></ul>   | 0:3] Mode of the   | communication protocol PicoBlaze6 - MicroBlaze                                                                                                                                             |
| <ul><li>bce_op[0:7</li></ul> | ] Vector opera     | tion to be performed.                                                                                                                                                                      |
| <ul><li>bce_port[0</li></ul> | 7] 8 bit output p  | ort. (Unconnected to external pins).                                                                                                                                                       |
| <ul><li>bce_port_i</li></ul> | for interna        | External port address. Address space [0x0 0x1F] are reserved all construction of the VLIW instruction to the 8xSIMD vector nit of the EdkDSP. Address space [0x20 0xFF] can be used by the |
| <ul><li>bce_port_v</li></ul> | r 1 bit output.    | Write strobe for write of 8 bit data to the external port address.                                                                                                                         |
| <ul><li>bce_r_pb</li></ul>   | 1 bit output.      | Reset of the PicoBlaze6.                                                                                                                                                                   |
| • bce_we                     |                    | Write strobe signals start of execution of a VLIW instruction by the or processing unit of the EdkDSP.                                                                                     |

These debug ports are used for the real-time visualisation, debug and analysis of the computation implemented inside of the 8xSIMD vector processing unit of the (8xSIMD) EdkDSP accelerator IP. This makes easier to debug the compiled PicoBlaze6 firmware code.

48/65



All vector operations of the (8xSIMD) EdkDSP **bce\_fp12\_1x8\_40** accelerator can be monitored at the **bce\_op[0:7]** debug port. These 8xSIMD vector operations are defined in Table 6.

Table 6: (8xSIMD) EdkDSP bce\_fp12\_1x8\_40 accelerator vector operations

| · · · · · · · · · · · · · · · · · · · |             | P bce_fp12_1x8_40 accelerator vector operations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|---------------------------------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name in MicroBlaze C                  | value (dec) | 8xSIMD Floating point Operation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                                       |             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| WAL_BCE_JK_VVER                       | = 0         | Return capabilities of the (8xSIMD) EdkDSP accelerator                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| WAL_BCE_JK_VZ2A                       | = 1         | 8xSIMD copy am[i] <= zm[j]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| WAL_BCE_JK_VB2A                       | = 2         | 8xSIMD copy am[i] <= bm[j]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| WAL_BCE_JK_VZ2B                       | = 3         | 8xSIMD copy bm[i] <= zm[j]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| WAL_BCE_JK_VA2B                       | = 4         | 8xSIMD copy bm[i] <= am[j]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                       |             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| WAL_BCE_JK_VADD                       | = 5         | 8xSIMD add zm[i] <= am[j] + bm[k] ]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| WAL_BCE_JK_VADD_BZ2A                  | = 6         | 8xSIMD add am[i] <= bm[j] + zm[k] ]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| WAL BCE JK VADD AZ2B                  | = 7         | 8xSIMD add bm[i] <= a[j] + z[k] ]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                                       |             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| WAL BCE JK_VSUB                       | = 8         | 8xSIMD sub zm[i] <= am[j] - bm[k]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| WAL_BCE_JK_VSUB_BZ2A                  | = 9         | 8xSIMD sub am[i] <= bm[j] - zm[k]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| WAL BCE JK VSUB AZ2B                  | = 10        | 8xSIMD sub bm[i] <= am[i] - zm[k]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                                       |             | The same of the sa |
| WAL BCE JK VMULT                      | = 11        | 8xSIMD mult zm[i] <= am[j] * bm[k]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| WAL_BCE_JK_VMULT_BZ2A                 |             | 8xSIMD mult am[i] <= bm[j] * zm[k]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| WAL_BCE_JK_VMULT_AZ2B                 |             | 8xSIMD mult bm[i] <= am[j] * zm[k]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| WAL_BEL_SK_VINOET_ALLE                | , - 13      | Oxonvio mait onitij v- anitij - zm(k), m-1o                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| WAL_BCE_JK_VPROD                      | = 14        | 8xSIMD vector products:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| WAL_BEL_3K_VI KOD                     | - 14        | zm[i] <= am'[jj+nn]*bm[kk+nn]; m=18; nn range 1255                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|                                       |             | 211[1] \- a11 [1] 1111 b11[kk1111], 111-10, 1111 alige 1233                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| WAL BCE JK VMAC                       | = 15        | 8xSIMD vector MACs:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| WAL_BCL_JK_VIVIAC                     | - 13        | zm[ii+nn] <= zm[ii+nn] + am[jj+nn] * bm[kjk+nn];                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                       |             | nn range 113                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| WAL_BCE_JK_VMSUBAC                    | = 16        | 8xSIMD vector MSUBACs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| VVAL_BCE_JK_VIVISUBAC                 | - 10        | zm[ii+nn] <= zm[ii+nn] - am[jj+nn] * bm[kjk+nn];                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                       |             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                                       |             | nn range 113                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| WAL_BCE_JK_VPROD_S8                   | = 17        | 8xSIMD vector product (extended)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| WAL_DCE_JK_VPKUD_58                   | - 1/        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                                       |             | $zm[i] \le (a1'[jj+nn]*b1[kk+nn]+a2'[jj+nn]*b2[kk+nn])$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|                                       |             | + (a3'[jj+nn]*b3[kk+nn]+a4'[jj+nn]*b4[kk+nn]) )<br>+                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|                                       |             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|                                       |             | ( (a5'[jj+nn]*b5[kk+nn]+a6'[jj+nn]*b6[kk+nn])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                       |             | + (a7'[jj+nn]*b7[kk+nn]+a8'[jj+nn]*b8[kk+nn]) );                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
|                                       |             | m=18; nn range 1255                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 14/41 BOE W 1/BW                      | 20          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| WAL_BCE_JK_VDIV                       | = 20        | vector division (extended) zm[i] <= a1[j] / b1[k]; m=18                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |



### 2.7 Use of In-circuit Logic Analyser (ILA)

Start demo design with ILA from the Vivado 2015.4 SDK. Start both terminals. The AMP demo design is running.

It executes AMP demo SW on ARM, MicroBlaze with three (8xSIMD) EdkDSP accelerators (each with PicoBlaze6 reprogrammable firmware) and the ILA HW interface configured for debugging of the first EdkDSP accelerator.

Start Vivado Lab Edition 2015.4 and select "Open Hardware Manager". See Figure 42.



Figure 42: Vivado Lab Edition 2015.4

Select: Open Target See Figure 43. Take all defaults by clicking Next button in coming screens.

The Vivado Lab Edition 2015.4 is at this stage connected to the debugged board by the jtag. Names and parameters of probes (see Figure 41) which can be captured by the ILA configuration on HW and visualised by Vivado Lab Edition 2015.4 are stored in file **debug\_nets.ltx.** 

In Vivado Lab Edition 2015.4, click on the "specify the probes file and refresh the device" link in the Trigger setup hw\_ila\_1 window.

50/65

Specify file

C:/VM=07/s30i1hm4/sh03\_hw\_platform\_0\_ila/ debug\_nets.ltx signal processing



See Figure 44. This will add the names and parameters of probes (see Figure 41) to the ILA Waveform window. Use + to select probes used for triggering, and select the condition for trigger for each probe and their combination (use AND as default).



Figure 43: Select Open Target



Figure 44: Select file with definition of probes present in HW.

Some of debug probes can be used to trigger the capturing of data. The ILA can be triggered from the EdkDSP firmware running on the PicoBlaze6 running inside of the (8xSIMD) EdkDSP unit.



In SDK, open the <code>edkdsp\_cc/a/a\_fp1124p1.c</code> file. See section of the modified C code <code>FIR</code> firmware. The code includes the additional call to the <code>pb2dfu\_set()</code> function. We will use it for selective triggering of the ILA in this specified point of computation of the EdkDSP accelerator.

File **fill\_FA1124P1\_program\_store.h** contains firmware resulting from compilation of C source code **a\_fp1124p1.c** (see Figure 39).

In Vivado Lab Edition, in the ILA configuration page, change the trigger condition to:

```
(bce port wr ==1) AND (bce port id[0:7]==0x20) AND (bce port[0:7]==0x01)
```

In Vivado Lab Edition 2015.4, arm the hw\_ila\_1 core by pressing **Run Trigger** button in **Hardware** window.

Armed hw\_ila\_1 core will wait until the recompiled EdkDSP firmware comes to the point, where PicoBlaze6 calls function pb2dfu\_set(0x20, 1).

ILA core starts to capture 32K samples of all debug signals with the sampling rate 150 MHz. Data are captured and sent via jtag to Vivado Lab Edition 2015.4 for visualisation and analysis in the waveform window. This snapshot stores the detailed trace of the initial 32k clock cycles of the FIR filter computation as defined by the SW displayed in Figure 39. The red trigger is corresponding to the event. See Figure 45.

The user can zoom in the data and define additional markers. Selected markers indicate single step of the FIR filter. It takes 308 clock cycles (150 MHz = 6.666 ns clock period) to compute the vector product of two floating point vectors (coefficients and data), both with length 250\*8=2000 elements and to update the input data vector (in a circular buffer).

This demonstrates how the Vivado Lab Edition 2015.4 [11] supports visibility and debug capabilities for the developer of the first (8xSIMD) EdkDSP accelerator firmware through the ILA core instantiated in the design.





Figure 45: FIR filter waveforms after the trigger in Vivado Lab Edition 2015.4.

While the system is running, we can modify the trigger condition to capture the initial phase of the LMS filter running on the same EdkDSP accelerator in a next stage of the application code.

In SDK, see edkdsp\_cc/a/a\_fp1124p0.c code implementing the LMS filter on the identical EdkDSP HW. See Figure 38.

The output function **pb2dfu\_set()** writes 0x00 to port 0x20 this time. In Vivado Lab Edition 2015.4, modify the trigger condition in the Trigger window to:

```
(bce_port_wr ==1) AND (bce_port_id[0:7]==0x20) AND (bce_port[0:7]==0x00)
```

In Vivado Lab Edition 2015.4, arm the hw\_ila\_1 core again by pressing the **Run Trigger** button in the **Hardware** window. The armed hw\_ila\_1 core will wait until the running demo design comes to the point, where PicoBlaze6 calls this dedicated function call pb2dfu\_set(0x20, 0); as defined in the LMS C code (See Figure 38). This C code is executed by the corresponding the PicoBlaze6 controller inside of the first EdkDSP accelerator.

53/65



This will trigger capturing of new 32K samples of all debug signals with the sampling rate 150 MHz and provide detailed trace of the initial 32k samples of the LMS filter computation. See Figure 46.



Figure 46: LMS filter waveforms after the trigger in Vivado Lab Edition 2015.4.

The red trigger corresponds to the event. We can zoom in the data and define additional markers. Selected markers to indicate single elementary step of the LMS filter. It takes 1154 clock cycles (150 MHz = 6.667 ns clock period) to compute the vector product of two floating point vectors (coefficients and data), both with length 250\*8=2000 elements, update the data vector (in a circular buffer), compute the prediction error and adapt the coefficients of the floating point LMS filter.

The bce\_op[0:7] debug signal is displayed in the analogue/hold mode. This helps to indicate the sequence of vector operations issued by the PicoBlaze6 firmware. Each step of the LMS algorithm is implemented as a sequence of seven vector operations of (8xSIMD) EdkDSP deined as in the PicoBlaze6 function lms(). See Figure 38.

The ARM code and the MicroBlaze code can be compiled with -O0, ..., -O3 optimisations and executed under both debuggers in combination with the ILA HW debug and visualisation.

The -O0 option provides lower performance on ARM and MicroBlaze processors, but the corresponding binary code includes no transformations. This makes the co-debugging of the ARM and MicroBlaze C code easier.

The MicroBlaze debugger helps also in debugging of the interactions of the MicroBlaze with the (8xSIMD) EdkDSP accelerators. Blocks of floating point data can be inspected and verified with support of the MicroBlaze debugger before and after the synchronisation points of the MicroBlaze API interface.



UTIA

The (8xSIMD) EdkDSP accelerator code and the floating point data path are deterministic. All operations can be also emulated in the MicroBlaze C code, including the exact sequence of all floating point operations.

The floating point HW unit of the MicroBlaze supports the single precision floating point ADD and MULT operations with bit-exact identical results to the floating point units used in the (8xSIMD) EdkDSP accelerators.

This determinism secures, that the MicroBlaze "golden C code" can deliver floating point results which are bit-exact identical to the (8xSIMD) EdkDSP accelerators. This is used for verification of algorithms executed by the (8xSIMD) EdkDSP accelerator.



Figure 47: Separate dashboard with display of temperature and voltage in Vivado Lab Edition 2015.4.

The Vivado Lab Edition 2015.4 jtag based interface supports also the continuous download of some additional signals measured in the ZYNQ fabric. Data can be opened in a separate dashboard.

The dashboard is presented in Figure 47. It displays the temperature inside of the ZYNQ fabric. It is oscillating around 62 degrees Celsius. The sampling rate is 0,5 sec.

These measurements run in the background and do not influence the HW and SW running on the monitored device.

See the running video processing system performing the HW accelerated full HD edge detection together with floating point computation of FIR and LMS filter and the ILA debug facility on Figure 48.





Figure 48: Accelerated video processing algorithm and ILA debug of the accelerated LMS filter.



## 3. Conclusions

This application note and related evaluation package document following general observations and conclusions:

- Programmable logic part of the Zynq XC7Z030-1I device is capable of implementation of the UTIA (8xSIMD) EdkDSP floating point accelerator together with the HW accelerated video processing chain for the Full HD HDMII-HDMIO video processing chain with fixed resolution 1920x1080p60.
- The total power consumption for the HW accelerated video processing is up to 8.2 W. This requires at least a passive heat sink.
- The video processing is significantly faster due to the HW accelerators. Acceleration from 6.7x to 35x has been reached in comparison to the 667 MHz ARM Cortex A9 SW solution. The acceleration of video processing in HW has been reached while the EdkDSP floating point accelerator computation are performed in the same time in the PL fabric of the Zyng device.
- Designs are provided with and without the ILA debug support.
- The combination of 32 bit MicroBlaze with the three (8xSIMD) EdkDSP floating point accelerators brings additional capability to accelerate computation in single precision floating point with performance in the range of 0.9 GFLOP/s to 1.4 GFLOP/s (1.419 GFLOP/s in case of FIR filter) in one of the three (8xSIMD) EdkDSP accelerators. This floating point performance comes at the expense of relatively moderate increase of the total power consumption of the system:
  - 6.98 W: ARM, md01 HW, MicroBlaze (+14 MFLOP/s) ......(+6.98 W)
  - 7.63 W: ARM, md01 HW, MicroBlaze, 3x EdkDSP, 3x no use (+14 MFLOP/s) ......(+0.65 W)
  - 7.84 W: ARM, md01 HW, MicroBlaze, 3x EdkDSP, 2x no use, 1x use +1.8 GFLOP/s (+0.21 W)
  - Power per one GFLOP/s for one (8xSIMD) EdkDSP accelerator (static+dynamic power, no ILA):
    - o One EdkDSP HW accelerator (150 MHz) computing LMS Filter: 324 mW/GFLOP/s
    - o One EdkDSP HW accelerator (150 MHz) computing FIR Filter: 226 mW/GFLOP/s
- MicroBlaze soft core with 3x (8xSIMD) EdkDSP accelerator takes significant part of Zynq PL resources. This limits the maximal number of parallel HW video processing chains and therefore limits the achievable reduction of the energy per pixel.

This application note documents how designs debugged and developed in the high level SDSoC 2015.4 environment can be exported to the end-user in form of SDK 2015.4 SW projects with precompiled HW designs.

Enclosed SDK 2015.4 projects provide opportunity for the user to make the top level SW adaptations and customisations of the final application in the C source code. The run-time re-programming of the 8xSIMD EdkDSP accelerator in C and ASM is also supported. This user customisation is possible SDK 2015.4 toolchain provided by Xilinx for free. User does not need the access to the SDSoC 2015.4 board support package and to the SDSoC 2015.4 license. Application note briefly explained the integration of the Full HD HDMII-HDMIO video processing chain with the HW accelerated video processing algorithms and three run-time reprogrammable 8xSIMD EdkDSP floating point accelerators on the commercially available modular hardware [1] – [8].





## 4. References

| [1]  | TE0715-03-30-1I Xilinx Zynq Z-7030 SoC Micromodule XC7Z030-1SBG485I (ind. temp. range -40°C to +85°C) |
|------|-------------------------------------------------------------------------------------------------------|
|      | https://shop.trenz-electronic.de/en/Products/Trenz-Electronic/TE07XX-Zyng-SoC/                        |
| [2]  | TE0715-04-30-1I SoC Micromodule with Xilinx Zyng XC7Z030-1SBG485I                                     |
| [2]  | (ind. temp. range -40°C to +85°C)                                                                     |
|      | https://shop.trenz-electronic.de/en/Products/Trenz-Electronic/TE07XX-Zyng-SoC/                        |
| [3]  | TE0715-03-30-1C SoC Micromodule with Xilinx Zyng XC7Z030-1SBG485C                                     |
| [5]  | (comerrcial. temp. range 0°C to +70°C)                                                                |
|      | https://shop.trenz-electronic.de/en/Products/Trenz-Electronic/TE07XX-Zyng-SoC/                        |
| [4]  | TE0715-04-30-1C SoC Micromodule with Xilinx Zyng XC7Z030-1SBG485C                                     |
| [4]  | (comerrcial. temp. range 0°C to +70°C)                                                                |
|      | https://shop.trenz-electronic.de/en/Products/Trenz-Electronic/TE07XX-Zyng-SoC/                        |
| [5]  | Heatsink for TE0715, spring-loaded embedded;                                                          |
| [0]  | https://shop.trenz-electronic.de/en/26922-Heatsink-for-TE0720-spring-loaded-embedded?c=38             |
| [6]  | EMC <sup>2</sup> -DP PC/104 OneBank Carrier for SoC Modules                                           |
|      | http://www.sundance.technology/som-cariers/pc104-boards/emc2-dp/                                      |
| [7]  | https://www.xilinx.com/products/boards-and-kits/1-7gkvgm.html                                         |
| [8]  | FMC Full HD HDMII-HDMIO extension board AES-FMC-HDMI-CAM-G                                            |
|      | http://products.avnet.com/shop/en/ema/3074457345623664802                                             |
| [9]  | PMODRS232: Serial converter & interface.                                                              |
|      | https://shop.trenz-electronic.de/de/23331-PMODRS232-Serial-converter-und-interface?c=215              |
| [10] | VMware Workstation Player Documentation                                                               |
|      | https://www.vmware.com/support/pubs/player_pubs.html                                                  |
| [11] | Vivado HLx Web Install Client - 2015.4.                                                               |
|      | http://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado                |
|      | -design-tools/2015-4.html                                                                             |
| [12] | SDSoC - 2015.4 Full Product Installation.                                                             |
|      | http://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/sdx-                  |
|      | development-environments/sdsoc/2015-4.html                                                            |
| [13] | EMC <sup>2</sup> – 'Embedded Multi-Core systems for Mixed Criticality applications in dynamic and     |
|      | changeable real-time environments' is an ARTEMIS Joint Undertaking project in the Innovation          |
|      | Pilot Programme 'Computing platforms for embedded systems' (AIPP5).                                   |
|      | http://www.emc2-project.eu/                                                                           |
| [14] | EMC <sup>2</sup> UTIA www page                                                                        |
|      | http://sp.utia.cz/index.php?ids=projects/emc2                                                         |



# 5. The evaluation version of the package can be downloaded from UTIA www pages [14] free of charge.

#### **Deliverables:**

The evaluation package includes evaluation bitstreams with three (8xSIMD) EdkDSP accelerators working in parallel with the HW-accelerated edge detection and motion detection algorithms for the Full HD HDMII-HDMIO video processing on the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) located on the Sundance EMC2-DP-V2 carrier [6] with the FMC card [8].

The evaluation package [14] includes bitstreams compiled with the evaluation version of the UTIA (8xSIMD) EdkDSP HW accelerator IP core. Evaluation IPs compiled in the enclosed bitstreams:

bce\_fp12\_1x8\_0\_axiw\_v1\_10\_c Evaluation version of the AXI-lite interface bce\_fp12\_1x8\_40 Evaluation version of the floating point data path

This evaluation version of the UTIA (8xSIMS) EdkDSP accelerator is compiled into bitstream with an HW limit on number of vector operations.

The termination of the nonexclusive, non-transferable evaluation license of this evaluation IP core is reported in advance by the demonstrator on the RS232 terminal. The evaluation designs run again after the reset.

The evaluation package [14] includes SDK 2015.4 SW projects with source code for MicroBlaze processor and ARM processor. SW projects support the family of UTIA (8xSIMD) EdkDSP accelerators for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) on Sundance EMC2-DP-V2 carrier board [6].

The evaluation package [14] includes SDK 2015.4 SW projects with C source code for ARM Cortex A9 processor (32bit) in standalone mode, C source code for MicroBlaze and C source code for the EdkDSP PicoBlaze6 controller.

The evaluation package [14] includes these static libraries for ARM Cortex A9 processor (32bit) for standalone mode:

| libfmc_imageon.a | SDK 2015.4 UTIA static library with interface functions for video IP cores |
|------------------|----------------------------------------------------------------------------|
| libwal.a         | SDK 2015.4 UTIA static library with EdkDSP API for MicroBlaze              |
| libsh01.a        | SDSoC 2015.4 static library for HW accelerator in project sh01             |
| libsh02.a        | SDSoC 2015.4 static library for HW accelerator in project sh02             |
| libsh03.a        | SDSoC 2015.4 static library for HW accelerator in project sh03             |
| libmd01.a        | SDSoC 2015.4 static library for HW accelerator in project md01             |

These libraries have no time restriction. Source code of these libraries is not provided in this evaluation package.

The evaluation package [14] includes these binary applications for Ubuntu:

edkdspppEdkDSP C pre-processor binary for Ubuntu in VMware Workstation 12 Player.edkdspccEdkDSP C compiler binary for Ubuntu in VMware Workstation 12 Player.edkdspasmEdkDSP ASM compiler binary for Ubuntu in VMware Workstation 12 Player.





These binary applications have no time restriction. The user of the evaluation package has nonexclusive, nontransferable license from UTIA to use these utilities for compilation of the firmware for the Xilinx PicoBlaze6 processor inside of the UTIA EdkDSP accelerators in precompiled designs. The source code of these compilers is owned by UTIA and it is not provided in the evaluation package.

The evaluation package [14] includes demonstration firmware in C source code for the Xilinx PicoBlaze6 processor for the family of UTIA EdkDSP accelerators for the Trenz TE0715-03-30-11 module [1] (and [2], [3], [4]) on Sundance EMC2-DP-V2 carrier board [6].

The evaluation package also includes compiled versions of this firmware in form of header files .h. These compiled firmware files can be used for initial test of the UTIA EdkDSP accelerators on the Trenz TE0715-03-30-11 module [1] (and [2], [3], [4]) on the Sundance EMC2-DP-V2 carrier board [6] without the need to install the UTIA compiler binaries and the Ubuntu image under the VMware Workstation 12 Player [10].

On email request to kadlec@utia.cas.cz , UTIA will send DVD with the Ubuntu image with pre-installed compiler binary files free of charge. The image can be played in the VMware Workstation 12 Player [10].

HW boards are not part of deliverables. HW can be ordered separately from [1] - [8].

Any and all legal disputes that may arise from or in connection with the use, intended use of or license for the software provided hereunder shall be exclusively resolved under the regional jurisdiction relevant for UTIA AV CR, v. v. i. and shall be governed by the law of the Czech Republic. See also the Disclaimer section.

60/65



# 6. Vivado projects with the evaluation version of the (8xSIMD) EdkDSP IP for the Artemis EMC2 project partners.

This evaluation package includes **Vivado 2015.4 projects** for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) located on the Sundance EMC2-DP-V2 carrier [6] with the FMC card [8] **with the evaluation version of the (8xSIMD) EdkDSP accelerator IP for the partners in the Artemis EMC2 project [13]** can be ordered from UTIA AV CR, v.v.i., by email request for quotation to <u>kadlec@utia.cas.cz</u>.

UTIA AV CR, v.v.i., will provide to the EMC2 project partner quotation by email. After confirmation of the quotation by the customer, UTIA AV CR, v.v.i., will send to the customer this invoice:

The Vivado 2015.4 projects for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) located on the Sundance EMC2-DP-V2 carrier [6] with the FMC card [8] with the evaluation version of the (8xSIMD) EdkDSP accelerator IP for the partners in the Artemis EMC2 project (Without VAT)

After receiving confirmation from the EMC2 project partner about the zero-invoice received, UTIA AV CR, v.v.i. will send within 5 working days by standard mail printed version of this application note together with DVD with the Deliverables described in this section.

#### **Deliverables:**

The evaluation package for EMC2 partners [8] includes the Vivado 2015.4 design projects which can be modified and recompiled by the EMC2 partner. The evaluation version of the UTIA (8xSIMD) EdkDSP accelerator is provided as part of the Xilinx Vivado 2015.4 design projects. Evaluation IPs included:

bce\_fp12\_1x8\_0\_axiw\_v1\_10\_c
bce\_fp12\_1x8\_40

Netlist of the evaluation version of the AXI-lite interface
Netlist of the evaluation version of the floating point data path

This netlist evaluation version of the UTIA (8xSIMS) EdkDSP accelerator has an HW limit on number of vector operations.

EMC2 project [13] partners have nonexclusive, non-transferable license from UTIA to integrate this evaluation netlist into their own Vivado 2015.4 designs and to compile them to unlimited number of bit-streams for the Xilinx ZYNQ xc7z030-1I and xc7z030-1C devices. This nonexclusive, non-transferable license has no time restriction.

The source code of the evaluation versions of the (8xSIMS) EdkDSP accelerator is the IP core owned by UTIA and the source code of it is not provided in the evaluation package to the EMC2 partners. The UTIA (8xSIMD) EdkDSP HW accelerator IP core is compiled with an HW limit on the number of vector operations.

The termination of the nonexclusive, non-transferable evaluation license is reported in advance by the demonstrator on the RS232 terminal. The evaluation designs run again after the reset.

The evaluation package for EMC2 partners includes SDK 2015.4 SW projects with C source code for ARM Cortex A9 processor (32bit) in standalone mode, C source code for MicroBlaze and C source code for the EdkDSP PicoBlaze6 controller.





The evaluation package [14] includes these static libraries for ARM Cortex A9 processor (32bit) for standalone mode:

libfmc\_imageon.a SDK 2015.4 UTIA static library with interface functions for video IP cores

libwal.aSDK 2015.4 UTIA static library with EdkDSP API for MicroBlazelibsh01.aSDSoC 2015.4 static library for HW accelerator in project sh01libsh02.aSDSoC 2015.4 static library for HW accelerator in project sh02libsh03.aSDSoC 2015.4 static library for HW accelerator in project sh03libmd01.aSDSoC 2015.4 static library for HW accelerator in project md01

These libraries have no time restriction. Source code of these libraries is not provided in this evaluation package.

The evaluation package for EMC2 partners includes SDK 2015.4 SW projects with source code for MicroBlaze processor and ARM processor. SW projects support the family of UTIA (8xSIMD) EdkDSP accelerators for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) on Sundance EMC2-DP-V2 carrier board [6].

The evaluation package for EMC2 partners includes these binary applications for Ubuntu:

edkdsppp
 edkdspcc
 edkdspcc
 edkdspcc
 edkdspasm
 EdkDSP C compiler binary for Ubuntu in VMware Workstation 12 Player.
 EdkDSP ASM compiler binary for Ubuntu in VMware Workstation 12 Player.

These binary applications have no time restriction. The user of the evaluation package has nonexclusive, non-transferable license from UTIA to use these utilities for compilation of the firmware for the Xilinx PicoBlaze6 processor inside of the UTIA EdkDSP accelerators in precompiled designs. The source code of these compilers is owned by UTIA and it is not provided in the evaluation package.

The evaluation package for EMC2 partners includes demonstration firmware in C source code for the Xilinx PicoBlaze6 processor for the family of UTIA EdkDSP accelerators for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) on Sundance EMC2-DP-V2 carrier board [6].

The evaluation package for EMC2 project [13] partners also includes compiled versions of this firmware in form of header files .h. These compiled firmware files can be used for initial test of the UTIA EdkDSP accelerators on the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) on the Sundance EMC2-DP-V2 carrier board [6] without the need to install the UTIA compiler binaries and the Ubuntu image under the VMware Workstation 12 Player [10].

On email request to <a href="mailto:kadlec@utia.cas.cz">kadlec@utia.cas.cz</a>, UTIA will send DVD with the Ubuntu image with pre-installed compiler binary files free of charge. The image can be played in the VMware Workstation 12 Player [10].

HW boards are not part of deliverables. HW can be ordered separately from references [1] – [8].

Any and all legal disputes that may arise from or in connection with the use, intended use of or license for the software provided hereunder shall be exclusively resolved under the regional jurisdiction relevant for UTIA AV CR, v. v. i. and shall be governed by the law of the Czech Republic. See also the Disclaimer section.



## 7. Vivado projects with the release version of the (8xSIMD) EdkDSP IP

This release package includes **Vivado 2015.4 projects** for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) located on the Sundance EMC2-DP-V2 carrier [6] with the FMC card [8] **with the release version of the (8xSIMD) EdkDSP accelerator IP with no HW limit on number of vector operations** can be ordered by a customer from UTIA AV CR, v.v.i., by sending email request for quotation to <u>kadlec@utia.cas.cz</u>.

UTIA AV CR, v.v.i., will provide quotation by email. After confirmation of the quotation by the customer, UTIA AV CR, v.v.i., will send to the customer this invoice:

Vivado 2015.4 projects for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) located on the Sundance EMC2-DP-V2 carrier [6] with the FMC card [8] with the release version of the (8xSIMD) EdkDSP accelerator IP with no HW limit on number of vector operations.

(Without VAT)

400,00 Eur

After receiving payment, UTIA AV CR, v.v.i. will send to the customer within 5 working days (by standard mail) the printed version of the application note together with a DVD with deliverables described in this section.

#### **Deliverables:**

The release package includes the Vivado 2015.4 design projects which can be modified and recompiled by the customer. Release IPs included:

| bce_fp12_1x8_0_axiw_v1_10_c | Release netlist of the evaluation version of the AXI-lite interface       |
|-----------------------------|---------------------------------------------------------------------------|
| bce_fp12_1x8_40             | Release netlist of the evaluation version of the floating point data path |
| bce_fp12_1x8_30             | Release netlist of the evaluation version of the floating point data path |
| bce_fp12_1x8_20             | Release netlist of the evaluation version of the floating point data path |
| bce_fp12_1x8_10             | Release netlist of the evaluation version of the floating point data path |

This release netlist versions of the UTIA (8xSIMS) EdkDSP accelerators have **no HW limit on number of vector operations.** 

The customer has a nonexclusive, non-transferable license from UTIA to integrate these netlists into own Vivado 2015.4 designs and to compile these netlists to an unlimited number of bit-streams for designs for the Xilinx ZYNQ xc7z030-1I and xc7z030-1C devices. This nonexclusive, non-transferable license has no time restriction.

The source code of the (8xSIMD) EdkDSP accelerator IP is owned by UTIA and it is not provided in the release package to the customer.

The release package includes SDK 2015.4 SW projects with C source code for ARM Cortex A9 processor (32bit) in standalone mode, C source code for MicroBlaze and C source code for the EdkDSP PicoBlaze6 controller.

63/65



The release package includes these static libraries for ARM Cortex A9 processor (32bit) for standalone mode:

| libfmc_imageon.a | SDK 2015.4 UTIA static library with interface functions for video IP cores |
|------------------|----------------------------------------------------------------------------|
| libwal.a         | SDK 2015.4 UTIA static library with EdkDSP API for MicroBlaze              |
| libsh01.a        | SDSoC 2015.4 static library for HW accelerator in project sh01             |
| libsh02.a        | SDSoC 2015.4 static library for HW accelerator in project sh02             |
| libsh03.a        | SDSoC 2015.4 static library for HW accelerator in project sh03             |
| libmd01.a        | SDSoC 2015.4 static library for HW accelerator in project md01             |

These libraries have no time restriction. Source code of these libraries is not provided in the release package.

The release package includes SDK 2015.4 SW projects with source code for MicroBlaze processor and ARM processor. SW projects support the family of UTIA (8xSIMD) EdkDSP accelerators for the Trenz TE0715-03-30-11 module [1] (and [2], [3], [4]) on Sundance EMC2-DP-V2 carrier board [6].

The release package includes these binary applications for Ubuntu:

| edkdsppp  | EdkDSP C pre-processor binary for Ubuntu in VMware Workstation 12 Player. |
|-----------|---------------------------------------------------------------------------|
| edkdspcc  | EdkDSP C compiler binary for Ubuntu in VMware Workstation 12 Player.      |
| edkdspasm | EdkDSP ASM compiler binary for Ubuntu in VMware Workstation 12 Player.    |

These binary applications have no time restriction. The user of the evaluation package has nonexclusive, non-transferable license from UTIA to use these utilities for compilation of the firmware for the Xilinx PicoBlaze6 processor inside of the UTIA EdkDSP accelerators in precompiled designs. The source code of these compilers is owned by UTIA and it is not provided in the evaluation package.

The release package includes demonstration firmware in C source code for the Xilinx PicoBlaze6 processor for the family of UTIA EdkDSP accelerators for the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) on Sundance EMC2-DP-V2 carrier board [6].

The release package also includes compiled versions of this firmware in form of header files .h. These compiled firmware files can be used for initial test of the UTIA EdkDSP accelerators on the Trenz TE0715-03-30-1I module [1] (and [2], [3], [4]) on the Sundance EMC2-DP-V2 carrier board [6] without the need to install the UTIA compiler binaries and the Ubuntu image under the VMware Workstation 12 Player [10].

On email request to <a href="mailto:kadlec@utia.cas.cz">kadlec@utia.cas.cz</a>, UTIA will send DVD with the Ubuntu image with pre-installed compiler binary files free of charge. The image can be played in the VMware Workstation 12 Player [10].

HW boards are not part of deliverables. HW can be ordered separately from references [1] – [8].

Any and all legal disputes that may arise from or in connection with the use, intended use of or license for the software provided hereunder shall be exclusively resolved under the regional jurisdiction relevant for UTIA AV CR, v. v. i. and shall be governed by the law of the Czech Republic. See also the Disclaimer section.





## **Disclaimer**

This disclaimer is not a license and does not grant any rights to the materials distributed herewith. Except as otherwise provided in a valid license issued to you by UTIA AV CR v.v.i., and to the maximum extent permitted by applicable law:

- (1) THIS APPLICATION NOTE AND RELATED MATERIALS LISTED IN THIS PACKAGE CONTENT ARE MADE AVAILABLE "AS IS" AND WITH ALL FAULTS, AND UTIA AV CR V.V.I. HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and
- (2) UTIA AV CR v.v.i. shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising under or in connection with these materials, including for any direct, or any indirect, special, incidental, or consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) even if such damage or loss was reasonably foreseeable or UTIA AV CR v.v.i. had been advised of the possibility of the same.

#### **Critical Applications:**

UTIA AV CR v.v.i. products are not designed or intended to be fail-safe, or for use in any application requiring fail-safe performance, such as life-support or safety devices or systems, Class III medical devices, nuclear facilities, applications related to the deployment of airbags, or any other applications that could lead to death, personal injury, or severe property or environmental damage (individually and collectively, "Critical Applications"). Customer assumes the sole risk and liability of any use of UTIA AV CR v.v.i. products in Critical Applications, subject only to applicable laws and regulations governing limitations on product liability.

