|
1 ============ |
|
2 Crash Events |
|
3 ============ |
|
4 |
|
5 **Crash Events** refers to a special subsystem of Gecko that aims to capture |
|
6 events of interest related to process crashing and hanging. |
|
7 |
|
8 When an event worthy of recording occurs, a file containing that event's |
|
9 information is written to a well-defined location on the filesystem. The Gecko |
|
10 process periodically scans for produced files and consolidates information |
|
11 into a more unified and efficient backend store. |
|
12 |
|
13 Crash Event Files |
|
14 ================= |
|
15 |
|
16 When a crash-related event occurs, a file describing that event is written |
|
17 to a well-defined directory. That directory is likely in the directory of |
|
18 the currently-active profile. However, if a profile is not yet active in |
|
19 the Gecko process, that directory likely resides in the user's *app data* |
|
20 directory (*UAppData* from the directory service). |
|
21 |
|
22 The filename of the event file is not relevant. However, producers need |
|
23 to choose a filename intelligently to avoid name collisions and race |
|
24 conditions. Since file locking is potentially dangerous at crash time, |
|
25 the convention of generating a UUID and using it as a filename has been |
|
26 adopted. |
|
27 |
|
28 File Format |
|
29 ----------- |
|
30 |
|
31 All crash event files share the same high-level file format. The format |
|
32 consists of the following fields delimited by a UNIX newline (*\n*) |
|
33 character: |
|
34 |
|
35 * String event name (valid UTF-8, but likely ASCII) |
|
36 * String representation of integer seconds since UNIX epoch |
|
37 * Payload |
|
38 |
|
39 The payload is event specific and may contain UNIX newline characters. |
|
40 The recommended method for parsing is to split at most 3 times on UNIX |
|
41 newline and then dispatch to an event-specific parsed based on the |
|
42 event name. |
|
43 |
|
44 If an unknown event type is encountered, the event can safely be ignored |
|
45 until later. This helps ensure that application downgrades (potentially |
|
46 due to elevated crash rate) don't result in data loss. |
|
47 |
|
48 The format and semantics of each event type are meant to be constant once |
|
49 that event type is committed to the main Firefox repository. If new metadata |
|
50 needs to be captured or the meaning of data captured in an event changes, |
|
51 that change should be expressed through the invention of a new event type. |
|
52 For this reason, event names are highly recommended to contain a version. |
|
53 e.g. instead of a *Gecko process crashed* event, we prefer a *Gecko process |
|
54 crashed v1* event. |
|
55 |
|
56 Event Types |
|
57 ----------- |
|
58 |
|
59 Each subsection documents the different types of crash events that may be |
|
60 produced. Each section name corresponds to the first line of the crash |
|
61 event file. |
|
62 |
|
63 crash.main.1 |
|
64 ^^^^^^^^^^^^ |
|
65 |
|
66 This event is produced when the main process crashes. |
|
67 |
|
68 The payload of this event is the string crash ID, very likely a UUID. |
|
69 There should be ``UUID.dmp`` and ``UUID.extra`` files on disk, saved by |
|
70 Breakpad. |
|
71 |
|
72 crash.plugin.1 |
|
73 ^^^^^^^^^^^^^^ |
|
74 |
|
75 This event is produced when a plugin process crashes. |
|
76 |
|
77 The payload is identical to ``crash.main.1``'s. |
|
78 |
|
79 hang.plugin.1 |
|
80 ^^^^^^^^^^^^^ |
|
81 |
|
82 This event is produced when a plugin process hangs. |
|
83 |
|
84 The payload is identical to ``crash.main.1``'s. |
|
85 |
|
86 Aggregated Event Log |
|
87 ==================== |
|
88 |
|
89 Crash events are aggregated together into a unified event *log*. Currently, |
|
90 this *log* is really a JSON file. However, this is an implementation detail |
|
91 and it could change at any time. The interface to crash data provided by |
|
92 the JavaScript API is the only supported interface. |
|
93 |
|
94 Design Considerations |
|
95 ===================== |
|
96 |
|
97 There are many considerations influencing the design of this subsystem. |
|
98 We attempt to document them in this section. |
|
99 |
|
100 Decoupling of Event Files from Final Data Structure |
|
101 --------------------------------------------------- |
|
102 |
|
103 While it is certainly possible for the Gecko process to write directly to |
|
104 the final data structure on disk, there is an intentional decoupling between |
|
105 the production of events and their transition into final storage. Along the |
|
106 same vein, the choice to have events written to multiple files by producers |
|
107 is deliberate. |
|
108 |
|
109 Some recorded events are written immediately after a process crash. This is |
|
110 a very uncertain time for the host system. There is a high liklihood the |
|
111 system is in an exceptional state, such as memory exhaustion. Therefore, any |
|
112 action taken after crashing needs to be very deliberate about what it does. |
|
113 Excessive memory allocation and certain system calls may cause the system |
|
114 to crash again or the machine's condition to worsen. This means that the act |
|
115 of recording a crash event must be very light weight. Writing a new file from |
|
116 nothing is very light weight. This is one reason we write separate files. |
|
117 |
|
118 Another reason we write separate files is because if the main Gecko process |
|
119 itself crashes (as opposed to say a plugin process), the crash reporter (not |
|
120 Gecko) is running and the crash reporter needs to handle the writing of the |
|
121 event info. If this writing is involved (say loading, parsing, updating, and |
|
122 reserializing back to disk), this logic would need to be implemented in both |
|
123 Gecko and the crash reporter or would need to be implemented in such a way |
|
124 that both could use. Neither of these is very practical from a software |
|
125 lifecycle management perspective. It's much easier to have separate processes |
|
126 write a simple file and to let a single implementation do all the complex |
|
127 work. |
|
128 |
|
129 Idempotent Event Processing |
|
130 =========================== |
|
131 |
|
132 Processing of event files has been designed such that the result is |
|
133 idempotent regardless of what order those files are processed in. This is |
|
134 not only a good design decision, but it is arguably necessary. While event |
|
135 files are processed in order by file mtime, filesystem times may not have |
|
136 the resolution required for proper sorting. Therefore, processing order is |
|
137 merely an optimistic assumption. |
|
138 |
|
139 Aggregated Storage Format |
|
140 ========================= |
|
141 |
|
142 Crash events are aggregated into a unified data structure on disk. That data |
|
143 structure is currently LZ4-compressed JSON and is represented by a single file. |
|
144 |
|
145 The choice of a single JSON file was initially driven by time and complexity |
|
146 concerns. Before changing the format or adding significant amounts of new |
|
147 data, some considerations must be taken into account. |
|
148 |
|
149 First, in well-behaving installs, crash data should be minimal. Crashes and |
|
150 hangs will be rare and thus the size of the crash data should remain small |
|
151 over time. |
|
152 |
|
153 The choice of a single JSON file has larger implications as the amount of |
|
154 crash data grows. As new data is accumulated, we need to read and write |
|
155 an entire file to make small updates. LZ4 compression helps reduce I/O. |
|
156 But, there is a potential for unbounded file growth. We establish a |
|
157 limit for the max age of records. Anything older than that limit is |
|
158 pruned. We also establish a daily limit on the number of crashes we will |
|
159 store. All crashes beyond the first N in a day have no payload and are |
|
160 only recorded by the presence of a count. This count ensures we can |
|
161 distinguish between ``N`` and ``100 * N``, which are very different |
|
162 values! |