

Tensilica Day

A Flexible ASIP Architecture for Connected Components Labeling: Implementation, Lessons Learned, and Integration into Novel Design Tools

Juan Fernando Eusse and Rainer Leupers Hannover, February 9<sup>th</sup> 2016





Connected Components Labeling ASIP

2 Methodological Observations

3 Pre-Architectural Performance Estimation





# Connected Components Labeling (CCL)

- Detect connected regions of pixels
  - Single pass algorithm
    - One iteration over the input frame
    - Additional data structures (memory) required
  - Rasterized processing of individual pixels
  - Uses 8 nearest neighbor mask
  - Collect region characteristics on-the-fly (merge at the end)







### Architectural Customizations (I)



- Iterative architecture exploration
- LISA architecture description language
- Synopsys Processor Designer RISC
- Added custom logic
  - Row buffer scratchpad + addressing
  - Label assignment logic
  - ET maintenance logic + register file
  - Features scratchpad + update Px Column
    logic
- HW size dependent on
  - Frame size
  - Number of possible labels





# Algorithmic Modification: Slicing Approach

Hardware size explodes with frame size/complexity





# Architectural Customizations (II)



# Experimental Results: Setup and Metric Definition

- Input data sets
  - CCL performance influenced by frame complexity
  - Publicly available frame sets
  - Synthetic and natural images
- Performance metric
  - Cycles-per-pixel (cpp) characterize architectural efficiency
  - Independent from core frequency
- Simulation setup
  - Cycle accurate simulator used
  - Best/average/worst cpp obtained

#### <u>Natural images</u>







#### Synthetic data sets





# Experimental Results: Synthesis and Performance

- Synthesized with design compiler
  - *350MHz@65nm*-1.8*mm*²
  - Estimated power consumption
    - 110mW@25°C
    - 228mW@125°C
- Both *cpp* and *fps* metrics used
  - >30 fps in avg for practical images
  - >10 fps worst case for most synthetic data sets
  - 5 fps in the absolute worst case





# Experimental Results: Flexibility and Comparison

- Impact of size and frame complexity
  - cpp variation given frame complexity
  - Super-slicing via SW (penalty observed)
- Performance comparison against
  - Original PD\_RISC (base)
  - TI TMS320C64x DSP
- Performance gains (Flickr)
  PD\_RISC: 6.7/33.1/87.8
  - TI DSP: 10.2/11.6/12.9



|                       |     | Ours  | PD_RISC<br>Base<br>(cpp) | Gain   | TMS320C64x<br>(cpp) | Gain  |
|-----------------------|-----|-------|--------------------------|--------|---------------------|-------|
| Homothety             | min | 3.14  | 22.0                     | 7.01   | 39.76               | 12.66 |
|                       | avg | 4.11  | 26.28                    | 6.39   | 42.70               | 10.38 |
|                       | max | 7.54  | 40.88                    | 5.42   | 50.60               | 6.71  |
| Random<br>Percolation | min | 3.31  | 17.16                    | 5.18   | 25.59               | 7.73  |
|                       | avg | 16.39 | 3,524                    | 215    | 394.42              | 24.06 |
|                       | max | 28.9  | 10,384                   | 359    | 1,107.53            | 38.32 |
| Finger                | min | 3.61  | 21.8                     | 6.04   | 35.81               | 9.2   |
|                       | avg | 5.03  | 84.07                    | 16.71  | 53.79               | 10.69 |
|                       | max | 6.49  | 167.48                   | 25.8   | 64.87               | 10.0  |
| Textures              | min | 3.12  | 19.75                    | 6.33   | 30.86               | 9.89  |
|                       | avg | 7.5   | 493.69                   | 65.82  | 106.12              | 14.15 |
|                       | max | 15.59 | 2,219.36                 | 142.35 | 303.20              | 19.45 |
| Flickr                | min | 3.3   | 22.38                    | 6.78   | 33.71               | 10.21 |
|                       | avg | 6.49  | 215.3                    | 33.17  | 75.25               | 11.60 |
|                       | max | 12.29 | 1,080.14                 | 87.88  | 158.56              | 12.9  |





Connected Components Labeling ASIP

2 Methodological Observations

3 Pre-Architectural Performance Estimation





# Methodological Observations: Design Gap



### Methodological Observations: Bridging the Gap

- Pre-architectural estimation of achievable performance
  - Use high level models to predict application cycles
  - Reduce the number of complete design iterations
  - Complement existing design flows







3

Connected Components Labeling ASIP

2 Methodological Observations

Pre-Architectural Performance Estimation





### Performance Estimation: Datapath





# Performance Estimation: Datapath (II)

- Base estimation only covers:
  - Architecture selection
  - Instruction set design
- Does a HW modification improve performance?
  - Discard sub-optimal mods
  - Analyze side effects
- Customization techniques to support:
  - Custom instructions/Legacy IP What-if scenarios based on code intervals







# Performance Estimation: Accuracy

- Usability depends on estimation accuracy
- Several commercial processors
  - PD\_RISC (Synopsys)
  - C67x/C64x/C66x (TI DSPs)
- Using Cycle Accurate Simulators
  - Flat memory model, no caching
- Integrated by *Silexica* as a general purpose estimator
  - ARM A7/A9/A15/M4 and
    Adapteva's Epiphany models
  - Parallel application mapping into heterogeneous MPSoCs



#### Average gain: 248x (PD-RISC), 67x (TI DSPs) (CA sim. time Vs. profiling + estimation time)



#### Performance Estimation: ASIP Design





#### Performance Estimation: ASIP Design (I)







Connected Components Labeling ASIP

2 Methodological Observations

3 Pre-Architectural Performance Estimation





### Conclusions

- Created an ASIP capable to perform CCL:
  - Solution supports arbitrary frame sizes with varying complexity
  - Capable of labeling FullHD frames at 45/30/5 fps in the best/average/worst case
  - Evaluation performed over an extensive data set of over 11000 images
  - Outperforms a commercial TI DSP for a factor of 10x
- Based on the performed ASIP design:
  - Realized a set of tools that enable high level performance estimation based on abstract processor models
  - Obtained accuracies up to  $\pm 15\%$  for the modeled processors
  - Estimation is up to 248x faster than cycle accurate simulation
  - Currently being applied by Silexica Software Solutions GmbH
    - Estimation used for MPSoC task mapping decisions
    - New processor models being created (ARM, Epiphany)





# Questions?

