



**Application Note** 



# **Benchmarks for STM32H7 MCUs**

Jiři Kadlec, Lukáš Kohout <u>kadlec@utia.cas.cz</u> <u>kohoutl@utia.cas.cz</u>

# **Revision history**

| Rev. | Date      |  |
|------|-----------|--|
| 0    | 1.12.2019 |  |
| 1    | 31.5.2020 |  |
| 2    |           |  |

**Description** Initial release Update



WAKeMeUP has been accepted for funding within the Electronic Components and Systems For European Leadership Joint Undertaking in collaboration with the European Union's H2020 Framework Programme (H2020/2014-2020) and National/Local Authorities, under grant agreement n° 783176

Author

J. Kadlec

J. Kadlec





# **Table of Contents**

| 1 Introd<br>1.1. | uction to benchmarks for STM32H7                                    |
|------------------|---------------------------------------------------------------------|
| 1.1.1.           | Matrix Multiplication in Floating Point                             |
| 1.1.1.1.         | Conclusions                                                         |
| 1.1.2.           | Julia set Fractals in Floating Point                                |
| 1.1.2.1.         | Conclusions 3                                                       |
| 1.2.             | Benchmarks of HW accelerated operations 4                           |
| 1.2.1.           | DMA data transfers                                                  |
| 1.2.1.1.         | Conclusions 4                                                       |
| 1.2.2.           | AES GCM Decryption/Encryption5                                      |
| 1.2.2.1.         | Conclusions                                                         |
| 1.2.3.           | CRC                                                                 |
| 1.2.3.1.         | Conclusions                                                         |
| 1.2.4.           | HASH                                                                |
| 1.2.4.1.         | Conclusions                                                         |
| 1.2.5.           | RANDOM Data                                                         |
| 1.2.5.1.         | Conclusions                                                         |
| 1.2.6.           | JPEG                                                                |
| 1.2.6.1.         | Conclusions                                                         |
| 1.2.7.           | MJPEG                                                               |
| 1.2.7.1.         | Conclusions                                                         |
| 1.2.8.           | SSL Client/Server                                                   |
| 1.2.8.1.         | Conclusions                                                         |
| 1.3.             | Scilab benchmark with NUCLEO-H755ZI as Terminal13                   |
| 1.3.1.           | Matrix multiplication in SciLab on ArduZynq in A9 MPU and in FP01x8 |
| Accelerat        | or                                                                  |
| 1.3.1.1.         | Conclusions                                                         |
| Disclaime        | r16                                                                 |



http://sp.utia.cz

Akademie věd České republiky Ústav teorie informace a automatizace AV ČR, v.v.i.

# **Table of Figures**

| Figure 1: Matrix multiplication on STM32756G_EVAL. SP and DP floating point                       |
|---------------------------------------------------------------------------------------------------|
| Figure 2: Matrix multiplication on NUCLEO-H755ZI in SP floating point on CM7 and CM4 1            |
| Figure 3: Matrix multiplication of square matrices in single or double precision floating point 2 |
| right of Mathy Multipleater of Square mathies in Single of double precision heating point 2       |
| Figure 4: Matrix multiplication single or double precision floating point MFLOP/s                 |
| Figure 5: Fractal benchmark - Relative performance                                                |
| Figure 6: Benchmark demonstrates significant performance increase for the 40 nm devices 4         |
| Figure 7: Benchmark results for the HW accelerated AES decryption and encryption                  |
| Figure 8: Benchmark results for HW accelerated CRC                                                |
| Figure 9: Benchmark results for HW accelerated HASH7                                              |
| Figure 10: Benchmark results for generator of rundom numbers                                      |
| Figure 11: Performance of the HW accelerated JPEG decoding and encoding                           |
| Figure 12: Performance of the HW accelerated MJPEG decoding10                                     |
| Figure 13: SSL Client/Server Ethernet benchmark11                                                 |
| Figure 14: Matrix multiplication on ArduZyng with Scilab and FP01x8 HW accelerator13              |
| Figure 15: NUCLEO-H755ZI terminal with menu control of Scilab algorithms on ArduZyng .14          |

# Acknowledgement

This work has been partially supported from project <u>WakeMeUp</u>, project number ECSEL 783176 and the corresponding Czech NFA (MSMT) institutional support project 8A18001.



# **1 INTRODUCTION TO BENCHMARKS FOR STM32H7**

This application note and evaluation package presents benchmarks developed by UTIA to evaluate the maturity of the 40nm STM32 devices with eFLASH.

Benchmark projects released in the evaluation package for the AC6 System Workbench for STM32 are developed by UTIA as modification of the STMicroelectronics STM32Cube\_FW\_F7\_V1.15.0 and STM32Cube\_FW\_H7\_V1.5.0 reference SW projects.

Benchmarks document performance increase of the 40nm STM32H7 devices with eFLASH in comparison to the 90 nm STM32F7 devices with eFLASH.



Figure 1: Matrix multiplication on STM32756G\_EVAL. SP and DP floating point.



Figure 2: Matrix multiplication on NUCLEO-H755ZI in SP floating point on CM7 and CM4.

- We use NUCLEO-H755ZI board with two MCUs, the 400 MHz CM7 MCU and the 200 MHz CM4 MCU on 40 nm CMOS device with 2MB eFLASH.
- We also use the NUCLEO-H753ZI board with single 400 MHz CM7 MCU on 40 nm CMOS device with 2MB eFLASH.
- The benchmark projects developed for STM32756G\_EVAL board serve for the comparison with the reference 216 MHz CM7 MCU 90nm CMOS device.



http://sp.utia.cz

atizace AV ČR, v.v.i.

All benchmark projects developed for NUCLEO-H755ZI and NUCLEO-H753ZI support the Adaruit 1.8" colour TFT 128x160 pixel display V2 and optionally the older version of the Adafruit TFT display V1.

# 1.1. Floating point benchmarks

This section presents single-precision and double-precision floating point benchmarks.

### 1.1.1. Matrix Multiplication in Floating Point

The benchmark computes matrix multiplication of square matrices with size [32x32] in case of CM4 MCU and [48x48] in case of CM7 MCUs.

SW code is using unrolling of the internal loops to optimize the achievable performance of the HW floating point unit.



Figure 3: Matrix multiplication of square matrices in single or double precision floating point

|         |        | [48x48]       | [48x48]       | [48x48]       | [32x32]       |
|---------|--------|---------------|---------------|---------------|---------------|
| MFLOPs  |        | STM32F756 CM7 | STM32H753 CM7 | STM32H755 CM7 | STM32H755 CM4 |
| FPU ON  | Single | 145,52        | 302,99        | 287,25        | 30,51         |
|         | Double | 4,13          | 87,77         | 81,32         | 1,66          |
| FPU OFF | Single | 6,04          | 11,90         | 12,11         | 3,86          |
|         | Double | 4,13          | 8,05          | 8,01          | 1,66          |

Figure 4: Matrix multiplication single or double precision floating point MFLOP/s

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-mult\_32d\_CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-mult\_32f\_CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-mult\_48d\_CM7 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-mult\_48f\_CM7

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-fpu-mult\_48d STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-fpu-mult\_48f

STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-fpu-mult\_48d STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-fpu-mult\_48f



2/16



Akademie věd České republiky Ústav teorie informace a automatizace AV ČR, v.v.i.

#### 1.1.1.1. Conclusions

The increase of MFLOP/s performance in single precision floating point for the 40nm device corresponds to the increased system clock

- 216 MHz for 90 nm device
- 400 MHz for 40 nm devices

The significant increase of MFLOP/s performance for double precision floating point for the 40nm device is due to the double precision HW unit implemented in the STM32H7 family of devices for the CM7 MCU.

The floating point performance of the 200 MHz CM4 MCU is HW accelerated only for the single precision floating point.

## 1.1.2. Julia set Fractals in Floating Point

The benchmark computes a simple mathematical fractal: the Julia set. For each point of the complex plane, we are evaluating the divergence speed of a defined sequence. The Julia set equation for the sequence is:  $z(n+1) = z(n)^{2} + c$ .



#### Figure 5: Fractal benchmark - Relative performance

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-frac-CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-frac-CM7 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-fpu-frac STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-fpu-frac

#### 1.1.2.1. Conclusions

The increase of MFLOP/s performance in single precision floating point for the 40nm device corresponds to the increased system clock

- 216 MHz for 90 nm device
- 400 MHz for 40 nm devices



The significant increase of MFLOP/s performance for double precision floating point for the 40nm device is due to the double precision HW unit implemented in the STM32H7 family of devices for the CM7 MCU.

# **1.2. Benchmarks of HW accelerated operations**

This section presents single precision and double precision floating point benchmarks.



#### 1.2.1. DMA data transfers

Figure 6: Benchmark demonstrates significant performance increase for the 40 nm devices

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-mdma-CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-mdma-CM7 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-mdma STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-dma

### 1.2.1.1. Conclusions

This benchmark demonstrates the significant performance increase for the 40 nm devices in comparison to the 90 nm devices. The MDMA HW support replaced the DMA HW.

The MDMA data transfer initiated from the lower power 200 MHz MC4 MCU provides comparable performance to the 400 MHz MC7 device. Specially in case of MDMA from eFLASH to RAM.



# 1.2.2. AES GCM Decryption/Encryption

This benchmark is comparing performance of the cryptographic processor. This CRYPTO peripheral is used to encrypt/decrypt data (Plaintext/Ciphertext) using AES Galois/counter mode (GCM) and generate the authentication TAG.

It is a fully compliant implementation of the following standards:

- DES and TDES as defined by FIPS PUB 46-3, 1999 October 25. It follows the American National Standards Institute (ANSI) X9.52 standard.
- AES as defined by FIPS PUB 197, 2001 November 26.

The CRYP processor may be used for both encryption and decryption in the Electronic codebook (ECB) mode, the Cipher block chaining (CBC) mode or the Counter (CTR) mode (in AES only).



Figure 7: Benchmark results for the HW accelerated AES decryption and encryption

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-aes-gcm-dec-CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-aes-gcm-enc-CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-aes-gcm-dec-CM7 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-aes-gcm-enc-CM7

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-aes-gcm-dec STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-aes-gcm-enc STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-aes-gcm-dec STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-aes-gcm-enc

#### 1.2.2.1. Conclusions

The HW accelerated AES performance for the 40 nm devices is increased in comparison to the 90 nm devices.

The HW accelerated AES performance initiated from the lower power 200 MHz MC4 MCU provides comparable performance to the 400 MHz MC7 device.



## 1.2.3. CRC

This benchmark compares performance of HW supported computation of the 32-bit CRC value of 32-bit wide data buffer with size 64 kBytes.



Figure 8: Benchmark results for HW accelerated CRC

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-crc-CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-crc-CM7 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-crc STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-crc

#### 1.2.3.1. Conclusions

Benchmark results indicate, that the HW accelerated CRC performance for the 40 nm devices is slightly decreased in comparison to the 90 nm devices.

The HW accelerated CRC performance initiated from the lower power 200 MHz MC4 MCU provides comparable performance to the 400 MHz MC7 device.



http://sp.utia.cz

a automatizace AV ČR, v.v.i.

### 1.2.4. HASH

This benchmark compares performance of the hash processor HW. It is a fully compliant implementation of

- the SHA (secure hash algorithm),
- the MD5 (message-digest algorithm 5) hash algorithm
- the HMAC (keyed-hash message authentication code) algorithm

suitable for a variety of applications. It computes a message digest:

- 160 bits for the SHA-1 algorithm,
- 256 bits for the SHA-256 algorithm,
- 224 bits for the SHA-224 algorithm,
- 128 bits for the MD5 algorithm

for messages of up to (2<sup>64</sup> - 1) bits. HMAC algorithms provide a way of authenticating messages by means of hash functions. HMAC algorithms consist in calling the SHA-1, SHA-224, SHA-256 or MD5 hash function twice.



#### Figure 9: Benchmark results for HW accelerated HASH

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-hash-CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-hash-CM7 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-hash STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-hash

#### 1.2.4.1. Conclusions

The HW accelerated HASH performances for the 40 nm devices are similar to the 90 nm device.

The HW accelerated HASH performance initiated from the lower power 200 MHz MC4 MCU provides comparable performance to the 400 MHz MC7 devices.



# 1.2.5. RANDOM Data

This benchmark compares performance of Random number generator (RNG). The RNG processor is a random number generator based on a continuous analogue noise. RNG provides a random 32-bit value to the host when it is read.



Figure 10: Benchmark results for generator of rundom numbers

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-rand-CM4 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-rand-CM7 STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-rand STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Benchmark-rand

#### 1.2.5.1. Conclusions

The performance of generator of random numbers for the 40 nm devices is significantly increased in comparison to the 90 nm devices.

The performance of generator of random numbers on the lower power 200 MHz MC4 MCU is comparable to performance of the 216 MHz MC7 on the 90nm device.



http://sp.utia.cz

8/16

### 1.2.6. JPEG

This benchmark demonstrates performance of the HW JPEG decoder of STM32H757 and STM32H753 CM7 MCU to decode an JPEG file.

Benchmark reads jpeg file from SDCard memory using Fatfs, decode it

using the JPEG HW decoder in DMA mode and display the final ARGB8888 image on the LCD mounted on the evaluation board.



Figure 11: Performance of the HW accelerated JPEG decoding and encoding

Used STM32Cube projects:

STM32Cube\_FW\_H7\_V1.5.0\STM32H747I-EVAL\Examples\JPEG\ Benchmark-JPEG\_DecodingUsingFs\_DMA STM32Cube\_FW\_H7\_V1.5.0\STM32H747I-EVAL\Examples\JPEG\ Benchmark-JPEG\_EncodingUsingFs\_DMA

STM32Cube\_FW\_H7\_V1.5.0\STM32H743I-EVAL\Examples\JPEG\ Benchmark-JPEG\_DecodingUsingFs\_DMA\ STM32Cube\_FW\_H7\_V1.5.0\STM32H743I-EVAL\Examples\JPEG\ Benchmark-JPEG\_EncodingUsingFs\_DMA

STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Applications\LibJPEG\ Benchmark-LibJPEG\_Decoding STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Applications\LibJPEG\ Benchmark-LibJPEG\_Encoding

#### 1.2.6.1. Conclusions

The performance of JPEG decoding on the 40 nm devices is significantly increased in comparison to the 90 nm devices due to introduction of the HW acceleration in the STM32H7 family of 40 nm devices.

The difference in the maximal FPS for STM32H757 and STM32H753 CM7 MCUs is related to different video frame sizes ([800x600] and [640x480]).



http://sp.utia.cz

## 1.2.7. MJPEG

This benchmark demonstrates performance of the HW JPEG decoder of STM32H757 and STM32H753 CM7 MCU to decode an MJPEG video file located on the uSD card.

If the decoded frame is the first one, HAL routine is called to retrieve the image parameters:

- image width,
- image height,
- image quality (from 1% to 100%)
- color space
- chroma sampling.

These parameters are used to initialize the DMA2D. The DMA2D performs the copy of the decoded frame to the display frame buffer together with the YCbCr to RGB conversion necessary for the display on the RGB LCD.



#### Figure 12: Performance of the HW accelerated MJPEG decoding

Used STM32Cube\_FW\_H7\_V1.5.0 and STM32Cube\_FW\_H7\_V1.5.0 projects:

STM32Cube\_FW\_H7\_V1.5.0\STM32H747I-EVAL\Examples\JPEG\ Benchmark-JPEG\_MJPEG\_VideoDecoding STM32Cube\_FW\_H7\_V1.5.0\STM32H743I-EVAL\Examples\JPEG\ Benchmark-JPEG\_MJPEG\_VideoDecoding

In case of STM32F756, the STM32Cube demonstration platform was used to benchmark the MJPEG decode performance. The video player module provides a video solution based on the STM32H7xxx and the STemWin movie APIs. It supports the playing movie in AVI format with ficed 640x480 pixels resolution.

See:

STM32Cube\_FW\_H7\_V1.5.0\Projects\STM32H743I-EVAL\Demonstrations\STemWin

#### 1.2.7.1. Conclusions

The performance of MJPEG decoding on the 40 nm devices is significantly increased in comparison to the 90 nm devices due to introduction of the HW acceleration JPEG decoder in STM32H757 and STM32H753 CM7 MCUs.

The difference in the maximal FPS for STM32H757 and STM32H753 CM7 MCUs is related to different video frame sizes ([800x600] and [640x480]).

10/16



http://sp.utia.cz

## 1.2.8. SSL Client/Server

This benchmarks compares performance of the 400 MHz STM32H753 40nm CM7 device with the 216 MHz STM32H756 90nm CM7 device. Both devices have cryptographic HW accelerators.

Two applications used for the comparison are running on top of the of STM32Cube HAL drivers, PolarSSL library and the LwIP SW stack in RTOS mode:

- SSL\_Client: This part of the benchmark proves the ability of the STM32H753 and STM32F756 device to exchange messages with a server over TCP/IP connectivity through a SSL connection. This application allows the user to connect the ST32F756-Eval board and the ST32H743-Eval board to a secure web server with SSL protocol.
- SSL\_Server: This part of the benchmark is a combination of HTTP with SSL protocol to provide encryption and secure identification of the server. This application allows the user to connect from a web browser to ST32F756-Eval board and the ST32H743-Eval evaluation board using SSL protocol.

The Secure Socket Layer (SSL) and Transport Layer Security (TLS) protocols provide communications security over the Internet and allow client/server applications to communicate in a way that is private and reliable. These protocols are layered above a transport protocol such as TCP/IP. SSL is the standard security technology for creating an encrypted link between server and client. This link ensures that all communication data remains private and secure.

Benchmark is comparing ST32F756-Eval board with the ST32F756 90nm MCU and the ST32H743-Eval with the STM32H753 40nm MCU.

Both compared MCUs have hardware cryptographic processor supports AES/128/192/256, Triple DES, DES, SHA-1, SHA-2, MD5 and RNG.

Both evaluation boards support embedded Ethernet 100 Mbit/s MAC.

Performance is also enhanced through the use of a dedicated DMA controler, and hardware checksums for the IP, UDP, TCP and ICMP protocols.



Figure 13: SSL Client/Server Ethernet benchmark



Used STM32Cube\_FW\_H7\_V1.5.0 and STM32Cube\_FW\_H7\_V1.5.0 projects:

 $\label{eq:stm32Cube} STM32Cube\_FW\_H7\_V1.5.0\Projects\STM32H743I-EVAL\Applications\mbedTLS\SSL\_Client\STM32Cube\_FW\_H7\_V1.5.0\Projects\STM32H743I-EVAL\Applications\mbedTLS\SSL\_Server$ 

STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Applications\mbedTLS\SSL\_Client STM32Cube\_FW\_F7\_V1.15.0\Projects\STM32756G\_EVAL\Applications\mbedTLS\SSL\_Server

#### 1.2.8.1. Conclusions

The SSL Client Server Ethernet performance for the 400 MHz 40 nm device is slightly increased in comparison to the reference 216 MHz 90 nm device.

This corresponds to faster execution of SW parts of the communication, while the performance of the hardware cryptographic processor is similar for both families of devices.



12/16

http://sp.utia.cz

Akademie věd České republiky Ústav teorie informace a automatizace AV ČR, v.v.i.  $\hfill \ensuremath{\mathbb{C}}$  2019 ÚTIA AV ČR, v.v.i. All disclosure and/or reproduction rights reserved

# 1.3. Scilab benchmark with NUCLEO-H755ZI as Terminal

### 1.3.1. Matrix multiplication in SciLab on ArduZyng in A9 MPU and in FP01x8 Accelerator

This benchmark works on the industrial 40nm UTIA demonstrator described in next chapter. The benchmark computes in floating point matrix multiplication of square matrices with size [64x64]. SW code is using unrolling of the internal loops to optimize the achievable performance of the HW floating point unit.



Figure 14: Matrix multiplication on ArduZynq with Scilab and FP01x8 HW accelerator

Created SW4STM32 benchmark projects:

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-scilab

STM32Cube\_FW\_H7\_V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-mult-CM4 STM32Cube FW H7 V1.5.0\Projects\NUCLEO-H755ZI-Q\Benchmark-fpu-mult-CM7 STM32Cube FW H7 V1.5.0\Projects\NUCLEO-H753ZI\Benchmark-fpu-mult STM32Cube FW F7 V1.15.0\Projects\STM32756G EVAL\Benchmark-fpu-mult

#### 1.3.1.1. Conclusions

The increase of MFLOP/s performance in single precision floating point for the execution of C functions from Scilab on A9 MPU 650 MHz device is slightly below the performance of the 400 MHz CM7 MPU. It is due to the overhead related to copy of input matrices by value from Scilab to C MEX function and copy of result of matrix multiplication by values from C MEX function back to Scilab interpret.

The matrix multiplication implemented as direct call of a function in C is on A9 MPU device faster. Input and output matrices are referenced by pointers.

The significant increase of MFLOP/s performance for single precision floating point computation in SIMD FP01x8 accelerator comes from the 8xSIMD parallel execution of vector operations. The FP01x8 accelerator is programmed by user-defined, firmware optimised for computation of the [64x64] matrix multiplication.



http://sp.utia.cz

The accelerator runs at 115 MHz. It is using autonomous HW DMA data transfer from/to DDR3 buffers allocated as continuous regions of memory isolated from the Linux memory management. The FP01x8 accelerator supports overlap of the SIMD single precision floating point computation with the HW supported data communication.

The FP01x8 accelerator is described in Application note and evaluation package "Evaluation version of 8xSIMD FP01x8 accelerator for ArduZynq shield". See:

http://sp.utia.cz/index.php?ids=results&id=te0723\_fp01x8

The terminal use NUCLEO-H755ZI board with two MCUs, 400 MHz CM7 and 200 MHz CM4 on 40 nm CMOS device with 2MB eFLASH.

NUCLEO-H755ZI board controls the Adaruit 1.8" colour TFT display V2 with joystick. The resolution of the display is 160x128 pixels.

NUCLEO-H755ZI board also supports serial communication with the ArduZynq shield. The ArduZynq shield works with Xilinx 28 nm Zynq device XC7Z010-1C with dual-core Arm Cortex A9 MPU running at 650 MHz and 512 Mbyte DDR3.



Figure 15: NUCLEO-H755ZI terminal with menu control of Scilab algorithms on ArduZynq

UTIA supports the Xilinx SDSoC 2018.2 compiler for the ArduZynq shield. The compiler serves for generation of HW accelerators with data movers (DMA or SG-DMA) from user defined C/C++ functions.

The programmable logic part of the Zynq device is configured with SIMD FP01x8 run-time reprogrammable single-precision floating point HW accelerator IP core.

The SIMD FP01x8 accelerator can be re-programmed in runtime by change of firmware. The firmware can be compiled directly on the A9 processor on the ArduZynq shield. Xilinx SDSoC 2018.2 compiler is not needed for this run-time re-compilation and reconfiguration of the firmware.

The dual core A9 processor on the ArduZynq shield is running Debian OS with preconfigured Petalinux 2018.2 kernel.



14/16

http://sp.utia.cz

Ústav teorie informace a automatizace AV ČR, v.v.i.

The ArduZynq shield is running Scilab command-line interpret client as Debian OS user space application. Scilab supports interpretation of double precision scripts and functions working with double precision matrix data.

The Scilab interpret can also call and execute compiled C functions written in Matlab MEX format. These functions can be compiled directly on the ArduZynq shield by C/C++ compiler, which is part of the Debian OS.

The NUCLEO-H755ZI board serves as serial terminal with menu-based GUI for selection of Scilab demos to be executed in the ArduZynq shield

Details related to the assembly of the terminal with the ArduZynq shield are described UTIA in application note: "Industrial 40 nm Demonstrator NUCLEO-STM32H755ZI-Q" This app. note is accessible for download together with the related evaluation package from: <u>http://sp.utia.cz/index.php?ids=results&id=H755ZI-Q</u>



15/16

http://sp.utia.cz

Akademie věd České republiky Ústav teorie informace a automatizace AV ČR, v.v.i.

## DISCLAIMER

This disclaimer is not a license and does not grant any rights to the materials distributed herewith. Except as otherwise provided in a valid license issued to you by UTIA AV CR v.v.i., and to the maximum extent permitted by applicable law:

(1) THIS APPLICATION NOTE AND RELATED MATERIALS LISTED IN THIS PACKAGE CONTENT ARE MADE AVAILABLE "AS IS" AND WITH ALL FAULTS, AND UTIA AV CR V.V.I. HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and

(2) UTIA AV CR v.v.i. shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising under or in connection with these materials, including for any direct, or any indirect, special, incidental, or consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) even if such damage or loss was reasonably foreseeable or UTIA AV CR v.v.i. had been advised of the possibility of the same.

#### **Critical Applications:**

UTIA AV CR v.v.i. products are not designed or intended to be fail-safe, or for use in any application requiring fail-safe performance, such as life-support or safety devices or systems, Class III medical devices, nuclear facilities, applications related to the deployment of airbags, or any other applications that could lead to death, personal injury, or severe property or environmental damage (individually and collectively, "Critical Applications"). Customer assumes the sole risk and liability of any use of UTIA AV CR v.v.i. products in Critical Applications, subject only to applicable laws and regulations governing limitations on product liability.

16/16



http://sp.utia.cz

Ústav teorie informace a automatizace AV ČR, v.v.i.