Generate Profiles Using PMU Metrics

Execution-time profiling captures code execution times, which can help you to identify bottlenecks–see Aggregate Execution-Time Profiles to Identify Worst Execution and View and Compare Code Execution Times. If you are unable to explain the observed execution times by examining the generated code, for Linux^®-based hardware, you can investigate further by analyzing metrics from the Performance Monitoring Unit (PMU) of the hardware.

For example, if the generated code multiplies two matrices, you might notice that the execution time increases significantly with matrix size. In this case, if you use the number of data cache misses as the metric for profiling generated code, you might observe how the cache size for the chosen algorithm affects its operation on the selected hardware. You can use the information to tailor the algorithm, improving its performance on the selected hardware.

In a software-in-the-loop (SIL) or processor-in-the-loop (PIL) simulation, use the model parameter CodeProfilingCodeMetrics to specify the metric that you want to extract from the PMU.

PMU Metric Options

This table shows the options you can use to extract PMU metrics.

Model Parameter	Option	Metric Extracted from PMU
`CodeProfilingCodeMetrics`	`'time'` (default)	N/A
	`'totinstr'`	Total instruction count
	`'floatinsts'`	FPU instruction count
	`'floatopts'`	FPU operation count
	`'intinstr'`	ALU instruction count
	`'loadinstr'`	Load instruction count
	`'storeinstr'`	Store instruction count
	`'totcyc'`	Total cycle count
	`'l1dcm'`	L1 data cache misses
	`'l1icm'`	L1 instruction cache misses
	`'l2dcm'`	L2 data cache misses
	`'l2icm'`	L2 instruction cache misses
	`'tlbdm'`	TLB data misses
	`'tlbim'`	TLB instruction misses
	`'memstall'`	Memory stall cycles
	`'totstall'`	Total stall cycles

Check PMU Support on Target or Development Computer

The default driver provided for Linux-based target hardware uses the Performance Application Programming Interface (PAPI) for collecting performance counter information. This library provides an abstraction layer between the generated code and the kernel and PMU underneath. To check that you can use the PMU on your target or development computer:

Install PAPI if it is not available. The software supports PAPI 6.0 or later for the Debian^®, Ubuntu^®, Red Hat^® Enterprise Linux™, CentOS^®, and Fedora^® operating systems. If you prefer, you can download the PAPI code and compile it manually.
To make perf enable the use of the performance events system by unprivileged users, write -1 to the perf_event_paranoid file:
```
sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"
```
To verify that counters are available and view the events that you can log, run the PAPI utility program
```
papi_avail
```
Or, if you intend to run a SIL simulation on your Linux development computer, in the Command Window, enter:
```
coder.profile.pmu.verifyPMUOnHost(metricOption)
```
By default, metricOption is 'totcyc'. This command generates, compiles, and runs a simple test program to verify that the drivers work on your development computer.

Generate and Analyze Profile

The workflow for producing a profile using a PMU metric is similar to the workflow that produces an execution-time profile.

Configure the model for execution-time profiling.

openExample('ecoder/SILPILVerificationExample', ...
             supportingFile='SILTopModel.slx')

set_param(gcs, 'CodeExecutionProfiling', 'on');
set_param(gcs, 'CodeProfilingInstrumentation', 'coarse');
set_param(gcs,...
          'CodeProfilingSaveOptions', 'AllData');

set_param(gcs,...
          'CovEnable', 'off');
covSettings = get_param(gcs, 'CodeCoverageSettings');
covSettings.CoverageTool = 'None';
set_param(gcs, 'CodeCoverageSettings', covSettings);

Specify the metric that you want to extract from the PMU. For example, to extract L1 data cache misses, enter:

set_param(gcs, 'CodeProfilingCodeMetrics', 'l1dcm');

Run the simulation.

simOut = sim(gcs);

When the simulation is complete, in the Simulink^® Editor, the profiled model components are colored blue. For top-model SIL or PIL simulations, the Simulink Editor background is also colored blue. The simulation also generates the variable executionProfile (default) in the object simOut.

To view the results in the Code Profile Analyzer, in the Profiling details panel, click the Code Profile Analyzer link. Or, from the Command Window, run:

coder.profile.show(simOut.executionProfile)

The app displays information in the Function Execution and Summary panels.

Function Execution panel displays Function Metrics view and Relative Function Metrics view.

In the Function Execution panel, the Function Metrics view provides a function-call tree and metrics for functions and their child functions.

The Function Metrics view provides this information:

Section — Name of profiled code section.
Function Name — Name of profiled function
Minimum Metric — Minimum metric value for function.
Average Metric — Average metric value for function.
Maximum Metric — Maximum metric value for function.
Average Self Metric — Average metric value, excluding metric values for child functions.
Maximum Self Metric — Maximum metric value, excluding metric values for child functions.
Calls — Number of calls to the function.
Execution Time Standard Deviation — A measure of the spread of function metric values about the average value.
Execution Time Variance — A measure of the dispersion of function metric values. The metric standard deviation is the square root of this value.

The Relative Function Metrics view provides function metric values as percentages of caller function and total metric values. The table contains this information.

Column	Comparison Performed
Self / Caller (%)	Function self metric value compared with the total metric value for the caller function.
Self / Task (%)	Function self metric value compared with the total metric value for the task.
Self / Simulation (%)	Function self metric value compared with the total metric value for the simulation.
Function / Simulation (%)	Function metric value, which includes self metric value and metric value for child functions, compared with the total metric value for the simulation.

The Results section of the toolstrip provides features that you can use in your analysis of functions:

To trace the model component associated with a generated function and its metric values:
1. In the function-call tree or a table, click the function. For example, CounterTypeA, which is called by step [0.1 0].
2. On the toolstrip, in the Results section, click Highlight Source. The Simulink Editor identifies the subsystem.
To view the function call in generated code, in the Results section, click Highlight Code. The Simulink Editor displays the call in the Code view.
To display the metric value distribution for the function, click Generate Distribution.
To view metric values for the generated code in an HTML report, click Generate Report.

Compare Profile Against Baseline

You can compare metric values from the current simulation against values from a baseline simulation.

Select the Comparison panel, which displays the Workspace variable and four controls.
From the Select a profile variable drop-down list, select the workspace variable that contains results from the baseline simulation.
Click Compare.

The panel displays:

The function-call tree.
A percentage comparison of metric values for the two simulations.

The table cells are colored:

Green if the current metric value is less than the baseline value.
Yellow if the current metric value is greater than the baseline value.

If the workspace variable for the baseline simulation does not contain data for a code section, the panel displays NaN.

If you click Compare in SDI, the Simulation Data Inspector uploads and compares the selected results.

To invert the comparison order, that is, make results from the current simulation the reference, click Invert order.

Register Custom Driver to Extract PMU Metrics

If your target hardware is not supported natively, you can register custom drivers to extract the required metrics from the PMU. Provided you define the API, you can extract metrics from the PMU of any target hardware. You are not limited to Linux-based target hardware or the PAPI library.

The Add profiling timer step in Use rtiostream API for PIL Target Connectivity shows how you can create a timer object for time retrieval. You can modify the step to enable the use of PMU metrics in code profiling.

Create a hardware counter object that describes the retrieval of counter information from the PMU. The object description must include a description of the function that retrieves counter values and its implementation.

This example uses a C function, pmu_counter, which returns a counter value as a uint64 data type.

HWCounterSignature = target.create('Function');
HWCounterSignature.Name = 'pmu_counter';
HWCounterSignature.ReturnType = 'uint64';

Capture the function in an API object.

HWCounterApi = target.create('API');
HWCounterApi.Functions = HWCounterSignature;
HWCounterApi.Language = target.Language.C;
HWCounterApi.Name = 'Hardware Counter API';

Capture the dependencies of the function, that is, the source and header files that are required to run the function.

HWCounterDependencies = target.create('BuildDependencies');
HWCounterDependencies.IncludeFiles = {'HWCounter.h'};
timerDependencies.IncludePaths = {'/home/user/Desktop/PMUDrivers'};
timerDependencies.SourceFiles = {'HWCounter.c'};

Create an object that combines the API and dependencies.

HWCounterImplementation = target.create('APIImplementation');
HWCounterImplementation.API = HWCounterApi;
HWCounterImplementation.BuildDependencies = HWCounterDependencies;
HWCounterImplementation.Name = 'Hardware Counter Implementation';

Create the hardware counter object and associate it with the L1 data cache miss counter.

HWCounter = target.create('HardwareCounter');
HWCounter.APIImplementation = HWCounterImplementation; 
HWCounter.Name = 'Hardware Counter';
HWCounter.Type = target.HardwareCounterType.L1DCMiss;

Append the new hardware counter to the timer and counters list in the processor object.

processor.Timers = [processor.Timers HWCounter];

The software supports these hardware counter types.

Hardware Counter Type	Metric Extracted from PMU
`target.HardwareCounterType.NumInstrs`	Total instruction count
`target.HardwareCounterType.FPUInstrs`	FPU instruction count
`target.HardwareCounterType.FPUOpts`	FPU operation count
`target.HardwareCounterType.ALUInstrs`	ALU instruction count
`target.HardwareCounterType.LdInstrs`	Load instruction count
`target.HardwareCounterType.StInstrs`	Store instruction count
`target.HardwareCounterType.TotCycles`	Total cycle count
`target.HardwareCounterType.L1DCMiss`	L1 data cache misses
`target.HardwareCounterType.L1ICMiss`	L1 instruction cache misses
`target.HardwareCounterType.L2DCMiss`	L2 data cache misses
`target.HardwareCounterType.L2ICMiss`	L2 instruction cache misses
`target.HardwareCounterType.DTLBMiss`	TLB data misses
`target.HardwareCounterType.ITLBMiss`	TLB instruction misses
`target.HardwareCounterType.MemStalls`	Memory stall cycles
`target.HardwareCounterType.TotStalls`	Total stall cycles

External Websites

https://icl.utk.edu/papi/