

## Xtensa + TIE Feature Update To RF.3 release – some key points of my choice ©

Marcus Binning, Senior AE Manager, Europe (marcusb@cadence.com) Tensilica User Day Hanover Feb 9<sup>th</sup> 2016

## cādence®

#### Agenda

- Overview
- Memory System
- TIE
- Tools / Debug







## Overview of features up to Release RF.3

- Not going to identify features by release, just where things stand today
  - The release notes will tell you when things were introduced
- We will cover some key (new) TIE language features
- First will cover some of the major "other" pieces of the Xtensa architecture
- NOTE This talk will concentrate on GENERIC features, and will not describe all the DSP ISAs currently available
  - Lots of information on <a href="http://ip.cadence.com/ipportfolio/tensilica-ip">http://ip.cadence.com/ipportfolio/tensilica-ip</a>

- Conceptually LX6 is a development of all the LX that went before
  - No MAJOR discontinuities .... Yet  $\ensuremath{\textcircled{\odot}}$





# Memory System





#### Xtensa Architecture: Local Memory Subsystem



cādence°

#### Xtensa Architecture: Dual Load/Store and Banking



#### cādence<sup>°</sup>

## Parallelize Memory Accesses with Dual Load/Store

• Requires FLIX instruction to fully utilize dual Load/Store

Requires Dual Ported Memories Configuration with C-Box → no dual ported memory required





- C-Box is an "arbiter-mux"
  - Multiple cycles are needed to satisfy multiple requests, if they are going to the same DRAMx address space
  - Loss some of FLIX efficiency when multiple Load/Store target same Data RAM

#### cādence<sup>®</sup>

## Data Banking Example with Two Banks





## Two Load/Store Units with Data Cache

#### Configuration requirements

- Must be a write-back D-cache
- Must have at least 2 banks
- Must have early restart (will be explained later)

#### • Each L/S Unit has its own copy of the DTag RAM.

- Two simultaneous requests to different addresses require two DTag lookups for parallel access
- DTag RAMs are written with the same data.



## DCache with Banking

Cache-lines of one way are spread over multiple banks  $\rightarrow$  Parallel access of values in the same cache line

|       |      |      |                     | _ |      |      |                     |
|-------|------|------|---------------------|---|------|------|---------------------|
|       |      | Bank |                     |   |      | Bank |                     |
| Way 0 | 0000 | 0    | Cache line 0, way 0 |   | 0000 | 1    | Cache line 0, way 0 |
|       | 0001 | 0    | Cache line 1, way 0 |   | 0001 | 1    | Cache line 1, way 0 |
|       | 0002 | 0    | Cache line 2, way 0 |   | 0002 | 1    | Cache line 2, way 0 |
|       | 0003 | 0    | Cache line 3, way 0 |   | 0003 | 1    | Cache line 3, way 0 |
|       |      | 0    | Cache line 4, way 0 |   |      | 1    | Cache line 4, way 0 |
|       |      |      |                     |   |      |      |                     |
|       |      | Bank |                     |   |      | Bank |                     |
|       | 0000 | 0    | Cache line 0, way 1 |   | 0000 | 1    | Cache line 0, way 1 |
| Way 1 | 0001 | 0    | Cache line 1, way 1 |   | 0001 | 1    | Cache line 1, way 1 |
|       | 0002 | 0    | Cache line 2, way 1 |   | 0002 | 1    | Cache line 2, way 1 |
|       | 0003 | 0    | Cache line 3, way 1 |   | 0003 | 1    | Cache line 3, way 1 |
|       | 0005 | 0    |                     |   |      | •    |                     |

#### Benefits:

- Cache lines are loaded to smaller memories
- Software accesses only the bank it needs

## Memory Errors and ECC

- Memory protection available for Caches (tag and data) and local RAM (DataRAM and InstRAM)
- Can configure the memory error method for Instruction Side and Data side separately
- Error Correction
  - Correct single bit memory errors "on-the-fly".
  - Correctable error will trigger a replay of the memory access with the corrected value
  - Xtensa does not write the error-corrected data back to the memory

Uncorrectable Error  $\rightarrow$  ECC double bit error or Parity errors will trigger exceptions

| Memory Type                             | Parity                | ECC                    |
|-----------------------------------------|-----------------------|------------------------|
| Data Memories                           | 1 bit per Byte        | 5 bits per Byte        |
| Data Caches                             | 1 bit per Byte        | 5 bits per Byte        |
| Instruction Memories                    | 1 bit per 32 bit word | 7 bits per 32 bit word |
| Instruction Caches                      | 1 bit per 32 bit word | 7 bits per 32 bit word |
| Instruction and Data<br>Cache tag array | 1 bit per tag         | 7 bits per tag         |



## **PIF** on Xtensa

 The Processor Interface (PIF) protocol is a high bandwidth, full duplex system interconnect protocol

#### PIF Port

- Reads and write accesses are never speculative
- Recommended interface for memory that require additional wait states

#### PIF is a configuration option

- If Cache is configured, PIF is required
- If Inbound PIF is configured, PIF is required
- PIF is fully mapped to Xtensa's 4Gbyte address range, excluding address ranges of InstRam, InstRom, DataRam, DataRom and XLMI,
- PIF additional optional features to decrease cache miss latency
  - Critical word first
  - Early Restart
  - Prefetch



## **PIF Protocol - Basic**

#### Split transaction interface

- Separate Request & Response Channels
- Master arbitrates for request channel
- Slave arbitrates for response channel

#### Simple flow-control for request and response

- Initiator: Valid
- Receiver: Rdy
- When both Valid and Rdy are asserted at rising edge of CLK, a transaction completes

#### Supports multiple outstanding requests

- Responses can be out of order
- 6-bits transaction ID used by requestor

#### • Write response for writes can be configured to identify bus data or address errors

- 16 unique IDs to allow up to 16 write outstanding write responses
- To ensure read/write memory ordering and synchronization use memw instruction
- $\rightarrow$  Xtensa stalls when memw is executed until all outstanding write responses have returned



#### Cache Prefetch: Overview

- Problem: long cache miss latency
  - Xtensa pipeline will replay and then stall during cache miss
  - It may take many cycles to load a cache line from system memory
  - Load miss or Instruction Fetch miss waits for PIF block read response
  - Need to load many values stored in contiguous addresses

#### Solution: cache Prefetch option

- Prefetch additional cache line(s) from contiguous addresses ahead of a cache miss
- Prefetch hardware Identifies a stream
- Requires Data cache; Instruction cache is optional
- Additional Cache lines are Stored in Prefetch Buffer
- Benefit: Miss penalty for long (50-100 cycles) system memory latency is reduced



#### Hardware Prefetch Modes

- Prefetch Buffer (SRAM) is outside of the Xtensa core
- Buffer size configurable to 8 or 16 entries, of Cache Line size
- Hardware Prefetch mode is controlled with the PREFCTL register

| PREFCTL Register | 8 | / 4     | 3 0     | ļ |
|------------------|---|---------|---------|---|
| reserved         |   | InstCtl | DataCtl |   |

| Prefetch<br>Control<br>Nibble Value | "Stream Detected"<br>when a cache miss<br>occurs and              | Additional cache<br>lines prefetched,<br>when <i>no</i> stream<br>is detected | Additional cache<br>lines prefetched,<br>when stream is<br>detected | *Maximum<br>Number of<br>buffer<br>entries per<br>stream |
|-------------------------------------|-------------------------------------------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------------------------|----------------------------------------------------------|
| 0x0                                 | -                                                                 | none (disabled)                                                               | none (disabled)                                                     | -                                                        |
| 0x4                                 | Hit in Prefetch Buffer<br>OR detected using<br>Miss History Table | none                                                                          | 2*                                                                  | 2                                                        |
| 0 <b>x</b> 5                        | Hit in Prefetch Buffer                                            | 1*                                                                            | 2*                                                                  | 2                                                        |
| <b>0x8</b>                          | Hit in Prefetch Buffer                                            | 2*                                                                            | 4*                                                                  | 4                                                        |

16 © 2016 Cadence Design Systems, Inc. All rights reserved.

## Example of "Mode 0x4" Prefetch Algorithm (Animated)



#### cādence<sup>®</sup>

## Why Prefetch direct to L1 ?



Assume for example 64-bit wide cache, and a 64-byte cache line

ightarrow Takes 8 cycles to move 1 cache line from Prefetch to L1

 Benefit: Prefetch to L1 increases performance for lower latency (1-20 cycles) system memory
 accesses, but might thrash the cache

#### Prefetch types in LX6 core

- Three prefetch types, one objective: avoid or reduce cache-miss latency
- All types use the same prefetch hardware
  - HW Prefetch
    - Automatically initiate prefetch upon stream detection or cache miss.
    - Predictive hardware prefetching of instruction and data
    - Alleviates penalty associated with large external memory latencies
  - SW Prefetch
    - User's software program initiates Prefetch
    - Prefetches single data cache-line into L1-Data Cache or Prefetch buffer.
  - Block Prefetch: New Feature
    - User's software initiates Prefetch
    - Allows prefetch of a block of data spanning multiple cache-lines
    - Block prefetches can be grouped and queued in user software
- <sup>19</sup> © 2016 Cadence Design Systems Inc. User cand limit number of Block prefetch upgrade requests



## Prefetch HW





#### **Block Prefetch Operations**

- Operations supported by XTHAL APIs
  - Load a block of data into the L1 data cache (upgrade)
  - Write back (Castout & invalidate) a block of data from the L1 data cache (downgrade)
  - allocate a block in the L1 data cache whose content to be modified, without reading memory (Prefetch and modify upgrade)
  - cache block operations: cancel, wait, abort, start new group
- User needs to initiate the block prefetch of a data block in advance
  - Block prefetch can eliminate cache misses (both in the normal case and in the "prefetch and modify"). Make sure to enable DL1 option (prefetch to L1 DCache) in PREFCTL register.
  - In the case of certain cache line allocation conflicts with the Load/Store units, the prefetch logic can choose to keep the prefetch response data in the prefetch buffer. Such prefetched data will be brought to L1 cache when the actual data cache miss occurs.

#### cādence°

#### Memory copy example

Assuming config has:

- Total 16 prefetch entries
- Cache line size is 128 bytes
- Allocate 8 entries for block prefetch

#include <xtensa/core-macros.h>

Begins block prefetch and groups the subsequent block prefetch requests

```
test fun(u8 *pdest, u8 *psrc) {
                                                 // 8 lines of 128 bytes each
    u32 size = 1024;
    // reserve 8 entries for block prefetch, Inst/Data Ctl: Prefetch 2 lines on stream
      detect
    xthal set cache prefetch long (XTHAL PREFETCH_BLOCKS(8) | XTHAL DCACHE_PREFETCH_MEDIUM |
      XTHAL ICACHE PREFETCH MEDIUM | XTHAL DCACHE PREFETCH L1);
                       /*- - - - other code here - - - */
    xthal dcache block prefetch for read grp (psrc, size);
                                                                   // Group prefetch
      begins
    xthal dcache block prefetch modify (pdest, size)
                                                             morefetch & modify; no PIF
      read.
                                                                             Prefetch and modify !
                       /*- - - - other code here - - - - */
                                                                          No PIF request, no refill to L1 cache.
    memcpy(pdest, psrc, size); // 1024 bytes copy
                                                                          Just request an L1 Cache allocation
                       /*- - - - - other code here - - - - */
```



#### Prefetch types – At a quick glance

|                             | HW Prefetch                                                             | SW prefetch                                       | Block Prefetch                                                        |  |  |
|-----------------------------|-------------------------------------------------------------------------|---------------------------------------------------|-----------------------------------------------------------------------|--|--|
| Xtensa Core<br>Architecture | LX 4 onwards                                                            | LX 4 onwards                                      | LX 6 onwards                                                          |  |  |
| Prefetch initiated by       | HW on stream detect                                                     | User's software program                           | User's software program                                               |  |  |
| Prefetch<br>quantity        | 1/2/4 cache lines – depending on aggressiveness. Software programmable. | Only one cacheline                                | One or Multiple cachelines                                            |  |  |
| Cache<br>supported          | Both Dcache and Icache *                                                | Only for DCache                                   | Only for Dcache                                                       |  |  |
| HW<br>configurability       | Prefetch buffer & number of entries – Core configuration option         | Prefetch buffer –<br>Core configuration<br>option | Prefetch buffer & number of<br>entries – Core configuration<br>option |  |  |
| Software<br>programmable    | Prefetch aggressiveness for I\$ and D\$ independently                   | Yes.                                              | Can reserve #entries in<br>prefetch buffer for block<br>prefetch      |  |  |
| Configuration register      | PREFCTL                                                                 | PREFCTL                                           | PREFCTL                                                               |  |  |

\* Prefetch always applies to the data cache, and applies to the <u>instruction cache</u> for some configurations of line sizes. (Changed in LX6)







### Important new TIE features

- Register file architectures
- FLIX design
- Hardware sharing (shared semantics)
- Protos some notes
- Tc reports

## **Register Files**

- In a load-store machine the register file architecture is extremely important
  - Most operands to computation will come from them in one form or another
- Now there is great flexibility in the architecture, size and depth of regfiles
  - Can be very wide (> 1024bit)
  - Can have arbitrary numbers of registers > 1 ... you want an 11-entry 73bit regfile ? You got it!
  - Can have up to 32 (count 'em) read ports and 16 ("are you sure?!") write ports
    - Remember that currently all read/write ports are fully bypassed ... you would not actually want to reach these
      limits because the bypass networks would likely be enormous ③
  - Can be linked into "virtual" register files in pairs or quads
    - Make a 16-entry 32-bit regfile usable as a 8-entry 64-bit or 4-entry 128-bit
    - Opens up new possibilities for DSP processing
- From the programmer's viewpoint, register files are where "custom ctypes" live
  - The compiler will manage scheduling, load/spill, allocation ....
- You can have a total of 24 separate register files in an Xtensa

#### Since (RE.2) Register Groups: 1 implementation but 3 virtual views



- A user regfile can be used as three virtual regfiles of different widths
  - x1:  $2^n$  entries, each *w* wide
  - x2:  $2^{n-1}$  entries, each 2w wide
    - Implicitly refers to 2 consecutive entries of physical implementation
  - x4:  $2^{n-2}$  entries, each 4w wide
    - Implicitly refers to 4 consecutive entries of physical implementation
- All three virtual regfiles use same physical hardware
  - XCC ensures no overlaps when allocating register indices for variables
- x1 virtual regfile is always available
- x2 and x4 virtual regfiles need to be enabled explicitly in TIE
  - Either or both can be enabled

27

#### Register Groups Syntax





#### Physical Implementation Choice MULTI-PORT Implementation

Figure shows a single read port on a register file with x2 and x4 register groups

#### regfile CR1 8 32 cr CR2=2 CR4=4



- The regfile is implemented as a single bank *w*-wide and *n*-deep
- Each regfile port is w-wide
- Regfile ports used per operand is variable
  - x1 regfile access uses 1 port
  - x2 regfile access uses 2 ports
  - x4 regfile access uses 4 ports
- Can potentially increase ports
   on regfile
  - Impacts area (regfile core and read/write pipelines)
  - Could adversely impact timing



#### Physical Implementation Choice VIEWS Implementation

- Regfile is composed of banks of equal depth
  - E.g. the regfile in figure has 4 banks, each 8-wide and 8-deep
- Regfile indices are interleaved across banks
  - E.g., Index 0 is in Bank 0, Index 1 is in Bank 1, and so on...
- Wider data is a concatenation of multiple banks
  - E.g., x2 data is a concatenation of data on Banks 0 and 1, or 2 and 3

Figure shows a single read port on a register file with x2 and x4 register groups



being accessed

regfile CR1 8 32 cr CR2=2 CR4=4

#### cādence°

#### FLIX Support Flexible Length Instruction eXtension

format f64 64 {slot0, slot1}
slot\_opcodes slot0 {ADD, AND}
slot\_opcodes slot1 {SUB, XOR}

- First iteration (circa 2003) Single 64-bit FLIX length allowed
- Since then …
  - Gradual relaxation of limits, shorter FLIX (32b) → longer FLIX (128b), more FLIX (2 lengths allowed, then more ...)
- Implementation Limits (rarely reached):
  - Max Formats $\rightarrow$ 24- Max slots $\rightarrow$ 64- Max slots (in 128b format)  $\rightarrow$ 30- Format lengths $\rightarrow$ 32 <= Format Length <= 128 (integer #bytes)</td>
- It's pretty Flexible 🙂
- Watch out you can create a monster of an instruction decode unit
  - Remember KISS !



## Sharing Hardware

#### Shared functions

- Very easy to use, can share hardware "per slot" or globally
- Useful for quick "what-if" analysis, but not the best for "final implementations"
  - there isn't time here to dig into "why", just take my word on it  $\ensuremath{\textcircled{\sc on}}$

#### Semantics are the way to go

- What we use 🙂
- Can be multi-cycle, can implement the HW for 100's of instructions
  - Which means they can be very complicated and difficult to write/debug  $\rightarrow$  need a structured design flow

#### • NOW Semantics can be SHARED across slots in a FLIX machine

- Still only one copy of the hardware
- More flexibility in HW design for the "mix" of instructions the compiler uses
- More efficient than using shared functions.
- See section 20.6 in RF.3 TIE Reference Manual



## Protos – what are they and why do they matter ?

- Essentially the mechanism for connecting xt-xcc to your hardware
- Used to define the programming model
  - E.g. operator overloading, type conversions, default types, instruction fusing (technically that's a "imap")

#### Can be single instructions, or sequences

```
proto BBE_SBMULSTATE { in xb_vecNx16 * a }{xb_vecNx16 t0, xb_vecNx16 t1, xb_vecNx16 t2, xb_vecNx16 t3}{
    BBE_MOVVBMULSTATE t0, 0;
    BBE_SVNX16_I t0, a, 0;
    BBE_MOVVBMULSTATE t1, 1;
    BBE_SVNX16_I t1, a, 32;
    BBE_MOVVBMULSTATE t2, 2;
    BBE_SVNX16_I t2, a, 64;
    BBE_MOVVBMULSTATE t3, 3;
    BBE_SVNX16_I t3, a, 96;
}
```

Greatly enhance the programmer's experience – read up on them

## TIE compiler reports

- When you compile some TIE code, tc creates some reports
  - And lots of other things ... they are all in the tdk/ directory
- Area estimates
- Power estimates
- Report on operand, regfile and FLIX design
- All useful, note the choice of word ... "estimate"
  - Do not start writing your marketing material based on these numbers  $\odot$



## Tdk .report – some highlights

- Very useful regfile information #ports and operand schedules
  - There is now an advanced TIE training class that goes into why this is important

```
### TIE Report File
# These are the instructions in each slot
slot <Inst> instructions: nop=NOP
    MB MOVMBVMBV
slot <fusion slot0> instructions: nop=NOP
    MB L64 I MB S64 I
slot <fusion slot1> instructions: nop=NOP
    POP COUNT128 64 POP COUNT64 16 POP COUNT64 32 POP COUNT64 8
### Register File port information
# These are the read and write ports for each regfile.
regfile AR:
        port rd0: width( 32) stage( 1 2) operands( {ars,0} {opnd ae sem ar to dr a0,0})
       port rd1: width( 32) stage( 1 2 3) operands( {art,0} {opnd MB L64 I address,0} {opnd MB S64 I address,0} {opnd ae sem ar to dr a1,0} {opnd
        port rd2: width( 32) stage( 1 2) operands( {art,1})
        port rd3: width( 32) stage( 1 2) operands( {ars,1})
        port wr0: width( 32) stage( 1 2) operands( {ar0,0} {ar12,0} {ar4,0} {ar8,0} {arr,0} {ars,0} {ars_entry,0} {art,0} {opnd_ae_sem_dr_to_ar_a,
       port wr1: width( 32) stage( 1 3) operands( {stage3 ar operand,0} {arr,1} {art,1} {opnd AE CALCRNG3 a,1} {stage3 ar operand,1})
regfile BR:
        port rd0: width( 1) stage( 1) operands( {bs,0})
       port rd1: width( 1 4 8 16) stage( 1 2) operands( {brall,0} {bs4,0} {bs8,0} {bt,0})
        port rd2: width( 1 2 4) stage( 2) operands( {bt2,0} {bt,1} {bt2,1} {bt4,1})
       port wr0: width( 1 2 4 16) stage( 1 2) operands( {br,0} {br2,0} {brall,0} {bt,0} {br,1} {br2,1} {br4,1})
regfile AE DR:
       port rd0: width( 64) stage( 2) operands( {frs,0} {opnd ae sem arithmetic v,0} {opnd ae sem arithmetic v0,0} {opnd ae sem cmov v0,0} {opnd
        port rd1: width( 64) stage( 2 3) operands( {frt,0} {opnd ae sem arithmetic v1,0} {opnd ae sem cmov v,0} {opnd ae sem cmpp v0,0} {opnd ae sem cmpp v0,0}
        port rd2: width( 64) stage( 2) operands( {frs,1} {opnd ae sem mul x2 S1 d0,1})
       port rd3: width( 64) stage( 2) operands( {frr,0} {frr,1} {opnd AE ZEROB v0,1} {opnd ae sem arithmetic v1,1} {opnd ae sem arithmetic va,1}
       port wr0: width( 64) stage( 2 3 5) operands( {frr,0} {opnd_ae_sem_ar_to_dr_v,0} {opnd_ae_sem_arithmetic_v,0} {opnd_ae_sem_cmov v,0} {opnd_ae_sem_cmov v,0}
        port wr1: width( 64) stage( 2 3 5) operands( {frr,1} {opnd POP COUNT128 64 res,1} {opnd ae sem ar to dr v,1} {opnd ae sem arithmetic v,1}
 © 2016 Cadence Design Systems, Inc. All rights reserved.
```

#### cādence<sup>®</sup>

ò

## Tdk .report – flops inferred in semantics

- Will be due to multi-cycle semantics
- Typically "wide" pipeline flops should be only one pipe stage deep
- Useful tool for spotting "inefficient" schedules ...

```
Semantic: MB_L64_I
        32 * 1 opnd_MB_L64_I_offset: 0 -> 1
Semantic: MB_S64_I
        32 * 1 opnd_MB_S64_I_offset: 0 -> 1
Semantic: POP_COUNT128_64
        1 * 2 POP_COUNT128_64: 0 -> 2
Semantic: popc_sem
        1 * 2 POP_COUNT64_8: 0 -> 2
        1 * 2 POP_COUNT64_16: 0 -> 2
        1 * 2 POP_COUNT64_16: 0 -> 2
        1 * 2 POP_COUNT64_32: 0 -> 2
        1 * 2 TIE_inst_cp1: 0 -> 2
```

### Report the generated flops in each semantic

• There are many other useful pieces of info in the .report files ...



## **TIE constructs – Honourable Mentions**

Interesting features to read up on

- TIE property "shared\_or" and "ignore state output"
  - Allows creation of "sticky bits" in your FLIX design (TIE RM section 4 and 20.2)
- Bitkill allows partial update of output operands
  - no need to create inouts where not needed (TIE RM 7.4, 7.5)
- Specialized\_op tell the compiler two instructions only differ e.g. by immediate
  - Useful when you have a "large immediate" and "small immediate" version of an instruction
  - Compiler will select the most efficient based on immediate value and code optimisation (TIE RM 20.1)
- Header\_include embed useful header info in the TIE file
  - TIE RM 20.9 Provide a header "automatically" along with your TIE headers
- TIEprint() diagnostics for ISS simulation
  - This is purely a simulation artefact useful for debugging (TIE RM 26)





## Tools / Debug





## **Performance Monitors**

Non-intrusive counting of events in a real HW target

- Performance Monitors hardware counting of events on the updated trace port
- When to use:
  - When ISS/XTSC system model is not enough to measure performance
    - Not enough of the system is modeled
    - Model is not accurate enough
    - Anything you have to have Silicon or FPGA
  - In-Field performance analysis, debugging

#### Use Cases:

- Statistical Profiling provides an overview
  - Overall performance profiling of target software with large data sets
  - Identify inefficient functions or sections in the code
- Details function analysis with direct counter access
  - Detailed analysis of individual functions or sections in the code
  - Identify inefficient constructs
  - Identify reasons for inefficiency due to stalls, cache misses,...

## What is counted ?

Up to 8 32-bit counters software programmable

- There are almost 100 countable hardware events, in **SELECT** groups listed below
- Select Groups:
  - Always Increment (counts cycles)
  - Overflow of Counter N-1
  - Successfully Retired Instructions
  - D-Side Global Stalls
  - I-Side Global Stalls
  - Exceptions and Replays
  - Holds and other Bubbles
  - I-TLB Access
  - I-Side Memory Accesses
  - D-TLB Access
  - Data Memory Load Instruction
     LS0,LS1

- Data Memory Stores Instruction
   LS0,LS1
- Data Memory Accesses

   LS0,LS1
- Multiple Loads/Stores
- Outbound Prefetch/Castout
- Inbound I or D Target
- Prefetch



#### Use the provided HAL APIs

- Always best to use the (supported) HAL APIs to access these types of functions
- See debug guide chapter 8
- See Software Toolkit userguide chapter 6

- The above give descriptions how to do profiling using these counters
- Remember  $\rightarrow$  They are on the trace port, they are NON-INTRUSIVE



#### Code Coverage Built into Xplorer

| <pre>popcount_TIE.c pop_count2.tie  utils.c % *10 int i, retval =0; for(i=0; i<count; !="dataPtr[i])" #if="" (verbose="" i++)="" if(goldptr[i]="" {=""> 1)         printf("CV8:: Iter %d Gold&gt; %d Data&gt; %d\n", i, gold# endif     retval++;     } }</count;></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | <pre># VERBOSE   stdint.h   stdio.h   compare_vec8(uint8_t*, uint8_t*, int) : int   compare_vec16(uint16_t*, uint16_t*, int) : int       compare_vec32(uint32_t*, uint32_t*, int) : int       init_vec8(uint8_t*, int) : void       ioit_vec16(uint16_t*, int) : void</pre>                                                                                                                                                                                                                                                                     | Search expression<br>Code Coverage vie<br>Scope Default<br>Local Help (3 hit<br>Code Coverag<br>Code Coverag | n:<br>ew<br>(s)<br>e View<br>ge View The Coo   | lated Topics 💷 B                                    | isplays code coverage<br>lines of code executed |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------------------------|-----------------------------------------------------|-------------------------------------------------|
| <pre>return retval ; }</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | (60) on the Community M                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                              |                                                |                                                     | €<br>File Fun ▽                                 |
| <pre>int compare_vec16(uint16_t *goldPtr, uint16_t *dataPtr, int count)      {         int i, retval =0;         for(i=0; i<count; !="dataPtr[i])" i++)="" if(goldptr[i]="" td="" {="" {<=""><td>Code Coverage ≥         File         <total>         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/</total></td><td>% Covered<br/>79.2<br/>53.8<br/>59.3<br/>100.0<br/>100.0<br/>100.0</td><td>Total LOC<br/>192<br/>39<br/>54<br/>43<br/>24<br/>32</td><td>Lines Executed<br/>152<br/>21<br/>32<br/>43<br/>24<br/>32</td><td>Lines Missed<br/>40<br/>18<br/>22<br/>0<br/>0<br/>0</td></count;></pre> | Code Coverage ≥         File <total>         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/         D:/customers/Fingerprint_Systems/workshop_jan2016/fuse_wsf3/</total> | % Covered<br>79.2<br>53.8<br>59.3<br>100.0<br>100.0<br>100.0                                                 | Total LOC<br>192<br>39<br>54<br>43<br>24<br>32 | Lines Executed<br>152<br>21<br>32<br>43<br>24<br>32 | Lines Missed<br>40<br>18<br>22<br>0<br>0<br>0   |
| <pre>     return retval;     int compare_vec32(uint32_t *goldPtr, uint32_t *dataPtr, int count)     {         int i, retval =0;         for(i=0; i<count; !="dataPtr[i])&lt;/td" i++)="" if(goldptr[i]="" {=""><td></td><td></td><td></td><td></td><td></td></count;></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                              |                                                |                                                     |                                                 |



### Quick points on code coverage

- Provides statistics on lines of source code executed over one or more runs
- Good for verifying complete exercising of all lines of source code (e.g. testing) or any areas of dead code that need to be removed
- Only supported in Xplorer, no command-line version is available
- Uses feedback data collection (code instrumenting) to check coverage
- Coverage can be checked using either software or hardware feedback mechanisms
- Better results at lower optimization levels
- Coverage under IPA not supported
- Runs are cumulative, so multiple runs may be done under different scenarios to verify complete coverage
- See Xplorer online help for details



- Hopefully some points of interest thanks for listening!
- Xtensa is a wide topic there's a lot to know
- There is an increasing amount of training going to the iLS system
- There are user groups
- There will be lunch ... shortly ③



# cādence®