summaryrefslogtreecommitdiff
path: root/docs/04-Cache-hit-rate-howto.txt
blob: 60d37a59dae597fdb50ad63edd5b974fee63a3ca (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
Cache hit-rate HOWTO
====================

A   Introduction

    The ARM Fast Models are accompanied with a trace infrastructure 
    referred to as the Model Trace Interface (MTI). The MTI trace
    provides a mechanism to dynamically register to events from the
    model. The GenericTrace.so MTI trace plugin provides a number of
    trace events whose output can be logged in a simple text file. 
    The usage of this plugin is given in Section B.

    In this document we will consider how the GenericTrace.so plugin
    can be used during a cluster switchover to calculate the number
    of cache hits in the outbound cluster L2 cache originating from
    the inbound cluster before the outbound L2 is flushed and the
    cluster placed in reset.

B   Plugin Usage

    The GenericTrace plugin is loaded using the "--trace-plugin" 
    parameter in the command line to launch the model.

    A list of trace sources provided by the plugin can be listed as
    follows:

    "RTSM_VE_Cortex-A15x1-A7x1 --trace-plugin GenericTrace.so
     --parameter TRACE.GenericTrace.trace-sources= "

    A list of parameters supported by the Generic Trace plugin can 
    be listed as follows:

    "RTSM_VE_Cortex-A15x1-A7x1 --trace-plugin GenericTrace.so -l"

    Some of the interesting parameters are:

    TRACE.GenericTrace.trace-file: The trace file to write into. If 
    empty will print to console / STDOUT.

    TRACE.GenericTrace.perf-period: Print performance every N 
    instructions. Since the instruction count and the global counter
    have the same value on the Fast Models, this parameter provides
    a good approximation of time.

    TRACE.GenericTrace.flush: If set to true then the trace file will be
    flushed after every event. 

C   Plugin Trace sources

    The GenericTrace plugin provides events which allow each cluster 
    to trace snoop requests originating from a different cluster that
    hit in its caches. For snoops originating from the Cortex-A7 cluster
    that hit in the A15 cluster, the event is 'read_for_4_came_from_snoop'
    & for the opposite case the event is 'read_for_3_came_from_snoop'.
    The numbers '3' & '4' in the name of the trace sources are the ids
    of the CCI slave interfaces from where the snoop originated.

    These trace sources are the per-cluster implementation of the
    event id '0xA' "(Read data last handshake - data returned 
    from the cache rather than from downstream)" of the CCI PMU.  
    Please refer to the "Cache Coherent Interconnect (CCI-400) 
    Architecture Specification" for further details.

    The plugin also provides the ability to trace code execution through
    a memory mapped "tube" interface. This interface defines a list of
    registers which when written to in a particular sequence and the 
    'sw_trace_event' trace source selected during model invocation will
    print out the register values in the trace file.

    The "tube" interface defines:

    - Three LE 64 bit registers of arbitrary data that can be 
      written (and retain their values).

    - A tube-like char register which when written with '\0' 
      will generate an event with the current state of the 
      64-bit registers and with the characters sent to the 
      device with a unique sequence_id.
  
    All of these registers are banked and write-only, the trace 
    event will also output the cluster id and the CPU id. ARM 
    FastModels implement 1 to 4 TUBE interfaces. Please refer to 
    Section E for supported interfaces in the current model 
    release. The memory map of these registers can be found in 
    big-little/include/misc.h.

    The 'write_trace' function in big-litte/lib/tube.c implements the
    software sequence to program the tube interface. This function is
    called at various points in switchover process. It prints out a
    message which indicates that an event is  about to start or has 
    completed alongwith the value of the global counter in one of the 
    64 bit registers. To enable this functionality, the environment
    variable "TUBE" needs to be defined to TRUE prior to code compilation.
    
D   Putting it all together

    The list of steps to use the above mentioned functionality is:
    
    1. Build the Virtualizer code with "TUBE" support. On the 
       tcsh shell, this is as follows;

       $ setenv TUBE TRUE; make clean && make

    2. Launch the model with the MTI trace plugin support and a 
       selection of the right trace sources using a suitable 
       MXScript file in the 'bootwrapper' directory.

    Once the switchover process starts, the trace file will contain output
    that looks like this (not including the comments):
    
    .
    .
    .
    .
    // Lines beginning with "PERFORMANCE" are a result of the value of the
    // "TRACE.GenericTrace.perf-period" parameter. This string is printed
    // every <value> number of instructions (200 in this case) in the trace
    // file. It indicates at what rate is the model executing instructions
    // & the number of instructions executed thus far.
    PERFORMANCE:   2.8 MIPS (Inst:67216767)
    .
    .
    .   
    // Lines beginning with "sw_trace_event<x>" are a result of enabling
    // "TUBE" support in the code and selecting the "sw_trace_event" source
    // while invoking the model. The interpretation of this message is:
    //
    // <x>          : indicates the "TUBE" interface number.
    // sequence_id      : a unique number assigned to each message
    // cluster_and_cpu_id   : in the format 0x<cluster id><cpu id>. Each id
    //            occupies 8 bits.
    // data0        : first 64-bit register value. Programmed with
    //            the value of the global counter.
    // data1        : second 64-bit register value. Not used.
    // data2        : third 64-bit register value. Not used.
    // message      : String written to the TUBE register
    sw_trace_event2: sequence_id=0x00000001 cluster_and_cpu_id=0x0000 data0=0x000000000401a3dc data1=0x0000000000000000 data2=0x0000000000000000 message="Secure Coherency Enable Start":30
    .
    .
    .
    PERFORMANCE:   0.2 MIPS (Inst:67217079)
    sw_trace_event2: sequence_id=0x00000002 cluster_and_cpu_id=0x0000 data0=0x000000000401a581 data1=0x0000000000000000 data2=0x0000000000000000 message="Secure Coherency Enable End":28
    PERFORMANCE:   0.9 MIPS (Inst:67217301)
    PERFORMANCE:   5.8 MIPS (Inst:67217511)
    .
    .
    .
    // Lines beginning with "read_for_<x>_came_from_snoop" are a result of 
    // enabling the event sources for monitoring the cache hits resulting 
    // from snoops originating from master interface <x> on the CCI.
    // The following line indicates that a snoop from the Cortex-A7 cluster
    // hit in the caches of the A15 cluster. It also prints the cache line
    // address and whether the access was Secure or Non-secure.
    read_for_4_came_from_snoop: Bus address=0x000000008ff02440 Is non secure=N
    read_for_4_came_from_snoop: Bus address=0x000000008ff02440 Is non secure=N
    read_for_4_came_from_snoop: Bus address=0x000000008ff02240 Is non secure=N
    read_for_4_came_from_snoop: Bus address=0x000000008ff02240 Is non secure=N
    read_for_4_came_from_snoop: Bus address=0x000000008ff012c0 Is non secure=N
    PERFORMANCE:   0.0 MIPS (Inst:135292834)
    sw_trace_event: sequence_id=0x00000010 cluster_and_cpu_id=0x0000 data0=0x000000000810672e data1=0x0000000000000000 data2=0x0000000000000000 message="L2 Flush Begin":15
    PERFORMANCE:   5.5 MIPS (Inst:135293056)
    PERFORMANCE:   7.2 MIPS (Inst:135293374)
    PERFORMANCE:   7.4 MIPS (Inst:135293587)
    PERFORMANCE:  12.4 MIPS (Inst:135293800)
    PERFORMANCE:  10.0 MIPS (Inst:135294118)
    read_for_4_came_from_snoop: Bus address=0x0000000080054a80 Is non secure=Y
    read_for_4_came_from_snoop: Bus address=0x0000000080054a80 Is non secure=Y
    read_for_4_came_from_snoop: Bus address=0x0000000080054ac0 Is non secure=Y
    read_for_4_came_from_snoop: Bus address=0x0000000080054ac0 Is non secure=Y
    read_for_4_came_from_snoop: Bus address=0x0000000080074c80 Is non secure=Y
    PERFORMANCE:   0.5 MIPS (Inst:135294331)
    .
    .
    .
    .
    PERFORMANCE:  10.5 MIPS (Inst:135541612)
    PERFORMANCE:   3.3 MIPS (Inst:135541929)
    sw_trace_event: sequence_id=0x00000011 cluster_and_cpu_id=0x0000 data0=0x0000000008143442 data1=0x0000000000000000 data2=0x0000000000000000 message="L2 Flush End":13
    .
    .
    .
    .

    Post-processing scripts can be developed which count the number of
    'read_for_<x>_came_from_snoop' events between two 'sw_trace_event<x>'
    events. In the above example, the result will be the number of snoop
    hits in the A15 caches while they were being flushed. In addition,
    the "PERFORMANCE" strings can be used to determine the cache hit rate.
    In this case, they indicate the number of hits in the last 200
    instructions. Repeated iterations can be done where each iteration
    changes the point of time when the L2 cache is flushed during a
    switchover. By monitoring its effect on the cache hit rate, a suitable
    time can be determined to power down the outbound L2 cache.

E   Status of "TUBE" support

    The current version of ARM FastModels (RTSM VE Cortex-A15 KF
    CCI version MODEL_VERSION) implements only one 'tube'
    interface i.e. TUBE0. 
    
    Subsequent releases will support upto four 'tube' interfaces i.e TUBE0-3.  
    The Virtualizer code has been internally tested to work with all four 'tube' 
    sources and assumes their presence. Writing to a non-existent
    'tube' interface is treated as a nop and the trace file will contain
    messages only from the 'sw_trace_event' source i.e TUBE0.  
    
    (Please correspond with the ARM FastModels team for details on future ARM 
    FastModels releases that will support all four tube interfaces).