



Hans Volkers Jens Benndorf

### **ASIP-Case Studies II(a):**

Heterogeneous Multicore Architecture for Image Sensor Processing featuring Tensilica Cores

Tensilica Day 9.2.2016 @ IMS-Hannover



# **DCT Company Profile**

# Dream Chip Technologies ...

tensilica

- Design Service Company especially for European customers in the SoC and embedded SW market.
  - Hardware and software solutions for real time imaging applications
  - Embedded software on various platforms
  - Concept engineering
  - Cadence/Tensilica Design Center Partner since 2011
  - Tensilica Designs since 2005
- The CODESIGN Experts

THINGS2DO



# The Things2Do-Project



- Things2Do: THIN but Great Silicon 2 Design Objects
- Schedule: 1 April 2014 30 September 2018
- THINGS2DO is an ENIAC project addressing semiconductor energy efficiency and design & development ecosystems for FD-SOI-technology
- More than 50 companies, institutes and universities from 12 countries are addressing different applications for 22/28nm FDSOI technology
- The Dream Chip contribution
  - Part of **DreamChip** is to create a complex SoC design for camera based ADAS applications
  - Part of LUH IMS is the reference software on the heterogeneous SoC design from DCT
  - CADENCE supplies EDA tools and IP infrastructure to the project, Global Foundries supply the 22nm FD-SOI technology/ manufacturing
  - Partners:







### Advanced Driver Assistance Systems (ADAS) overview



Source: autobild.de



# Assisted Driving requires Cameras, Radar and Ultrasonic



Source: Auto-Medienportal.net

6



# Use Case #1: Digital Mirroring







# Use Case #1: Digital Mirroring - The Multiview Idea

 Automotive multi camera systems for Bird-View, Rear-View and Panorama-View are a major part of today's emerging technologies to make driving more safe and comfortable and to move towards autonomous vehicles.





# Use Case #2 : 360 deg Top View Camera





Verba der K

# **TopView Harmonization**

|                                              | <ul> <li>H show help</li> <li>Q quit this program</li> <li>X off harmonization bypass</li> <li>F DN use filtered means</li> <li>D off display measurement window</li> <li>T ON use temporal low-pass filte</li> <li>W DN use difference-dependent we</li> <li>B off force bottom gains to 1.0</li> <li>C DN use smooth transition betwe</li> <li>G off display gains on gray image</li> <li>S create screenshot</li> </ul> | r for gains<br>ight for gains<br>en edges                                  |                                                                                     |                                                                         |                                                                                                                                     |
|----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|-------------------------------------------------------------------------------------|-------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
|                                              | upper left corner >><br>upper right corner >><br>lower left corner >><br>lower right corner >>                                                                                                                                                                                                                                                                                                                             | aogR avgG avgI<br>134 170 204<br>149 142 132<br>126 124 135<br>145 140 140 | augR augG augB<br>160 157 165<br>153 152 157<br>121 118 120<br>137 135 137          | augR augG augR<br>184 172 167<br>78 78 82<br>189 174 181<br>153 143 133 | augR augG aug R<br>109 106 Be<br>98 111 144<br>191 168 136<br>209 180 122                                                           |
|                                              |                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                            |                                                                                     |                                                                         |                                                                                                                                     |
|                                              |                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                            |                                                                                     |                                                                         |                                                                                                                                     |
|                                              |                                                                                                                                                                                                                                                                                                                                                                                                                            | 5 1.052 1.117 1.000<br>8 1.205 0.951 1.000                                 | 0.966 0.951 0.895 1.000 0<br>1.097 1.064 1.030 1.000 1<br>1.092 1.087 1.079 1.000 0 | .182 1.163 0.987 1.000 1<br>.853 0.883 0.873 1.000 0                    | ainH gainG gainU weight<br>).846 0.860 1.013 1.000<br>1.170 1.240 1.190 1.000<br>0.895 0.924 0.101 1.000<br>0.776 0.830 1.052 1.000 |
| baupositionen                                |                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                            |                                                                                     |                                                                         |                                                                                                                                     |
| Kameras<br>Erfassungsbereiche<br>der Kameras |                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                            |                                                                                     |                                                                         |                                                                                                                                     |

Dream Chip Technologies GmbH



# Introduction and Classification Image Sensor Processing (ISP)





# Introduction – Image Sensor Processing Overview





# **Heterogeneous Cores for Image Sensor Processing**





**Dream Chip Technologies GmbH** 

# ISP Algorithms



# Image Sensor Processing – Meta Pipeline





# **Planes and Task Types**

 Data plane algorithms can be mapped on two different levels with two specific data type and data parallelism requirements





# **Typical Image Sensor Processing Workload**

- Small Effort Functions: ~10 Operations per Pixel
  - e.g. Gamma Corr., LSC, White Balance Gain, CSC, Cropping
- Medium Effort Functions: ~100 Operations per Pixel
  - e.g.: Debayering, CAC, small Gauss and Median Filter, Defect Pixel Correction
- High Effort Functions: ~1000 Operations per Pixel
  - e.g.: Med. Bilateral Filtering, Motion Estimation, 3DNR, Encoding /Decoding
- "Typical" ISP Pipeline: ~4k Operations per Pixel
- Lets assume 4 sensors @1080p30: 240 Mega Pixels per second => 1000 Giga Operations per second are required

#### Typical workload is beyond pure SW based embedded computing

Dream Chip Technologies GmbH



# Architecture Proposal



# **Tensilica Core Customization – Explanation by 2 Corner Cases**

- Small Core (e.g. mini108)
- ~1 data plane operation per core cycle
- 32bits L/S per core cycle
- Base instructions

- Big Core (e.g. IVP / Vision P5)
- ~100 pixel operations per core cycle
- 1024bits L/S per core cycle
- Custom instructions







# **Codesign Approach**



- Divide the dataplane SW into function on "Vectorizable Pixel Level" and "Object Level"
  - Use customized/optimized cores for the pixel level and for the object level
  - Multiple cores per class operate in parallel on pipeline stages and image segments
- Use HW engines for base computations which do not change (e.g. filter, transformations, compressions standards)
  - Use the dataplane SW flexibility to "glue" the HW results together



# **Optimized Cores for Pixel Level and Object Level**

#### Customized Core for Pixel Level

- Tensilica/Cadence Vision P5: Imaging Video Processor
- Predefined customization available ("Tensilica IVP'")
- 4 issue VLIW 32way SIMD custom instruction set
- App. 100 pixel level operations per clock
- Programmable by C-Code with Intrinsics

#### • Pre-Optimized Core for the Object Level:

- ARM Cortex A17 Quad Core
- Automotive Standard Core for Image Analysis
- Out of the box well suited for multi task 32bit object processing
  - Quad core, FPU support, short vectors by NEON
  - App. 10 object level operations per clock
- ADAS software libraries available





# Dream CHIP Architecture Proposal with Details

**Dream Chip Technologies GmbH** 





### Effort versus Performance Matrix (22nm, FDSOI)

|                             |    | Area<br>(total) | Area<br>(PHY)      | Area<br>(Logic)    | Area<br>(Mem)      | Gates    | Memory   | Clock | Techn<br>ology | Perfor-<br>mance | Power<br>(total) | Power<br>(PHY) | Power<br>(Logic) | Power<br>(Mem) |
|-----------------------------|----|-----------------|--------------------|--------------------|--------------------|----------|----------|-------|----------------|------------------|------------------|----------------|------------------|----------------|
| Area = 38.4 mm2             |    | 38.4            |                    |                    |                    |          |          |       |                | 2.8              | 4                |                |                  |                |
| Power = 4 Watt              |    | 30.4            |                    |                    |                    |          |          |       |                | 2.0              | 4                |                |                  |                |
| Instances                   | #  |                 | [mm <sup>2</sup> ] | [mm <sup>2</sup> ] | [mm <sup>2</sup> ] | [kGates] | [KBytes] | [GHz] |                | [TOPS]           | [W]              | [W]            | [W]              | [W]            |
| CPU Cores                   |    |                 |                    |                    |                    |          |          |       |                |                  |                  |                |                  |                |
| IVP_EP (100 op/cycle)       | 1  | 4.8             | (n.a.)             | 0.5                | 4.3                | 1000     | 2172     | 1     | LVT            | 100              | 0.55             | (n.a.)         | 0.31             | 0.24           |
| IVP_EP (100 op/cycle)       | 1  | 1.8             | (n.a.)             | 0.5                | 1.3                | 1000     | 640      | 1     | LVT            | 100              | 0.39             | (n.a.)         | 0.31             | 0.07           |
| IOP ARM A17 (10 op/cycle)   | 1  | 3.0             | (n.a.)             | 2.0                | 1.0                | (n.a.)   | 512      | 1     | LVT            | 10               | 0.30             | (n.a.)         | 0.20             | 0.10           |
| SP (2xLX6) (2 op/cycle)     | 2  | 0.6             | (n.a.)             | 0.1                | 0.3                | 100      | 128      | 0.5   | RVT            | 2                | 0.04             | (n.a.)         | 0.01             | 0.01           |
| AXI Interconnect            | 1  | 0.5             | (n.a.)             | 0.5                | 0.0                | 1000     | 0        | 0.5   | RVT            | (n.a.)           | 0.12             | (n.a.)         | 0.12             | -              |
| HW Processing               |    |                 |                    |                    |                    |          |          |       |                |                  |                  |                |                  |                |
| Engines 1 (1k op/cycle)     | 4  | 6.4             | (n.a.)             | 0.8                | 0.9                | 1500     | 430      | 0.15  | RVT            | 600              | 0.29             | (n.a.)         | 0.06             | 0.01           |
| Engines 2 (1k op/cycle)     | 1  | 1.3             | (n.a.)             | 0.3                | 1.0                | 500      | 512      | 0.5   | RVT            | 500              | 0.09             | (n.a.)         | 0.06             | 0.03           |
| Engines 3 (0.5k op/cycle)   | 1  | 0.9             | (n.a.)             | 0.5                | 0.4                | 1000     | 192      | 0.5   | RVT            | 500              | 0.14             | (n.a.)         | 0.12             | 0.01           |
| Engines 4 (0.5k op/cycle)   | 4  | 0.3             | (n.a.)             | 0.1                | 0.0                | 100      | 9        | 0.5   | RVT            | 1000             | 0.05             | (n.a.)         | 0.01             | 0.00           |
| System Memory               |    |                 |                    |                    |                    |          |          |       |                |                  |                  |                |                  |                |
| DMA                         | 1  | 0.2             | (n.a.)             | 0.1                | 0.1                | 200      | 48       | 0.5   | RVT            | (n.a.)           | 0.03             | (n.a.)         | 0.02             | 0.00           |
| Local Memory (L3)           | 2  | 1.0             | (n.a.)             | 0.0                | 0.5                | 10       | 256      | 0.5   | RVT            | (n.a.)           | 0.03             | (n.a.)         | 0.00             | 0.02           |
| DDR                         | 2  | 5.3             | 2.4                | 0.1                | 0.1                | 200      | 64       | 0.5   | RVT            | (n.a.)           | 0.66             | 0.3            | 0.02             | 0.00           |
| Sensor Interface            |    |                 |                    |                    |                    |          |          |       |                |                  |                  |                |                  |                |
| MIPI-RX                     | 4  | 2.0             | 0.5                | 0.0                | 0.0                | 0        | 0        | 0.5   | RVT            | (n.a.)           | 0.08             | 0.02           | 0.00             | -              |
| System Interfaces           |    |                 |                    |                    |                    |          |          |       |                |                  |                  |                |                  |                |
| PCIe                        | 2  | 3.3             | 1.5                | 0.1                | 0.1                | 160      | 32       | 0.5   | RVT            | (n.a.)           | 0.54             | 0.25           | 0.02             | 0.00           |
| GBE                         | 1  | 0.1             | ext                | 0.1                | 0.0                | 200      | 16       | 0.5   | RVT            | (n.a.)           | 0.03             | ext            | 0.02             | 0.00           |
| USB3                        | 1  | 1.5             | 1.2                | 0.1                | 0.3                | 160      | 128      | 0.5   | RVT            | (n.a.)           | 0.20             | 0.17           | 0.02             | 0.01           |
| MIPI-TX (Display)           | 1  | 1.2             | 1.1                | 0.0                | 0.0                | 25       | 8        | 0.5   | RVT            | (n.a.)           | 0.04             | 0.04           | 0.00             | 0.00           |
| SPI-M                       | 6  | 0.2             | (n.a.)             | 0.0                | 0.0                | 25       | 8        | 0.5   | RVT            | (n.a.)           | 0.02             | (n.a.)         | 0.00             | 0.00           |
| others                      | 1  | 0.0             | (n.a.)             | 0.0                | 0.0                | 25       | 8        | 0.5   | RVT            | (n.a.)           | 0.00             | (n.a.)         | 0.00             | 0.00           |
| System                      |    |                 |                    |                    |                    |          |          |       |                |                  |                  |                |                  |                |
| System controller / topleve | 1  | 0.1             | (n.a.)             | 0.1                | 0.0                | 100      | 0        | 0.5   | RVT            | (n.a.)           | 0.01             | (n.a.)         | 0.01             | -              |
| PLL                         | 4  | 0.2             | 0.04               | 0.0                | 0.0                | 0        | 0        | 1     | (n.a)          | (n.a.)           | 0.02             | 0.005          | 0.00             | -              |
| Area IOs                    | ## | 2.3             | (n.a.)             | 0.007              | 0.0                | (n.a)    | 0        | 0     | (n.a)          | (n.a.)           | 0.01             | (n.a.)         | 0.00             | -              |
| Voltage Regulators          | 3  | 1.5             | 0.5                | 0.0                | 0.0                | 0        | 0        | 0     | (n.a)          | (n.a.)           | 0.30             | 0.1            | 0.00             | -              |

• Average area figures @ 22nm :

~2 MGates/mm<sup>2</sup>,

- Average power figures @1GHz:
- ~500 mW for 1mm<sup>2</sup> Logic

~100 mW for 1MByte

~0.5 MByte/mm<sup>2</sup> (large SP-SRAMs)



# **System Scalability**

• Chip to chip interconnect by dual PCIe ports e.g. for ring interconnect





Dream Chip Technologies GmbH

# Application Examples



# **ADAS System Study "Smart Rearview Mirror"**



- System partitioning depending on customer requirements
  - Example: http://www.bmwblog.com/2016/01/05/bmw-i8-shows-mirrorless-camera-technology/



# Things2Do Timeline

| 2014                         | 2015 | 2016         | 2017         | 2018      |  |  |  |  |  |  |
|------------------------------|------|--------------|--------------|-----------|--|--|--|--|--|--|
|                              |      |              |              |           |  |  |  |  |  |  |
| FDSOI Technology Development |      |              |              |           |  |  |  |  |  |  |
| 28nm                         |      | 22/14nm      |              |           |  |  |  |  |  |  |
|                              |      | Things2Do    |              |           |  |  |  |  |  |  |
|                              |      | - F          | +            |           |  |  |  |  |  |  |
|                              |      | Tape<br>Out1 | Tape<br>Out2 |           |  |  |  |  |  |  |
|                              |      |              | Prototype    | Prototype |  |  |  |  |  |  |



# Summary

- Introduction to the Things2Do project and DCTs contribution
- Introduction to Image Sensor Processing
  - Image Processing, Image Analysis, Computer Vision and Computer Graphics
     => Requirements for heterogeneous computing
- ISP Algorithms
  - Meta Pipeline, Base Functions, Quantitative Analysis
    - => Requirements for base functions in HW
- ISP Architecture
  - Suggestion for a "Trinity" on the data plane:
     => SW for Object Level, SW for Pixel Level and HW Engines for "base functions"
- ISP Applications
  - Examples with Application-to-Architecture Mappings









# Thank You'

Please contact info@dreamchip.de



Dream Chip Technologies GmbH Steinriede 10 D-30827 Garbsen/Hannover Germany ++49-5131-90805-0