1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/toolkit/components/crashes/docs/crash-events.rst Wed Dec 31 06:09:35 2014 +0100 1.3 @@ -0,0 +1,162 @@ 1.4 +============ 1.5 +Crash Events 1.6 +============ 1.7 + 1.8 +**Crash Events** refers to a special subsystem of Gecko that aims to capture 1.9 +events of interest related to process crashing and hanging. 1.10 + 1.11 +When an event worthy of recording occurs, a file containing that event's 1.12 +information is written to a well-defined location on the filesystem. The Gecko 1.13 +process periodically scans for produced files and consolidates information 1.14 +into a more unified and efficient backend store. 1.15 + 1.16 +Crash Event Files 1.17 +================= 1.18 + 1.19 +When a crash-related event occurs, a file describing that event is written 1.20 +to a well-defined directory. That directory is likely in the directory of 1.21 +the currently-active profile. However, if a profile is not yet active in 1.22 +the Gecko process, that directory likely resides in the user's *app data* 1.23 +directory (*UAppData* from the directory service). 1.24 + 1.25 +The filename of the event file is not relevant. However, producers need 1.26 +to choose a filename intelligently to avoid name collisions and race 1.27 +conditions. Since file locking is potentially dangerous at crash time, 1.28 +the convention of generating a UUID and using it as a filename has been 1.29 +adopted. 1.30 + 1.31 +File Format 1.32 +----------- 1.33 + 1.34 +All crash event files share the same high-level file format. The format 1.35 +consists of the following fields delimited by a UNIX newline (*\n*) 1.36 +character: 1.37 + 1.38 +* String event name (valid UTF-8, but likely ASCII) 1.39 +* String representation of integer seconds since UNIX epoch 1.40 +* Payload 1.41 + 1.42 +The payload is event specific and may contain UNIX newline characters. 1.43 +The recommended method for parsing is to split at most 3 times on UNIX 1.44 +newline and then dispatch to an event-specific parsed based on the 1.45 +event name. 1.46 + 1.47 +If an unknown event type is encountered, the event can safely be ignored 1.48 +until later. This helps ensure that application downgrades (potentially 1.49 +due to elevated crash rate) don't result in data loss. 1.50 + 1.51 +The format and semantics of each event type are meant to be constant once 1.52 +that event type is committed to the main Firefox repository. If new metadata 1.53 +needs to be captured or the meaning of data captured in an event changes, 1.54 +that change should be expressed through the invention of a new event type. 1.55 +For this reason, event names are highly recommended to contain a version. 1.56 +e.g. instead of a *Gecko process crashed* event, we prefer a *Gecko process 1.57 +crashed v1* event. 1.58 + 1.59 +Event Types 1.60 +----------- 1.61 + 1.62 +Each subsection documents the different types of crash events that may be 1.63 +produced. Each section name corresponds to the first line of the crash 1.64 +event file. 1.65 + 1.66 +crash.main.1 1.67 +^^^^^^^^^^^^ 1.68 + 1.69 +This event is produced when the main process crashes. 1.70 + 1.71 +The payload of this event is the string crash ID, very likely a UUID. 1.72 +There should be ``UUID.dmp`` and ``UUID.extra`` files on disk, saved by 1.73 +Breakpad. 1.74 + 1.75 +crash.plugin.1 1.76 +^^^^^^^^^^^^^^ 1.77 + 1.78 +This event is produced when a plugin process crashes. 1.79 + 1.80 +The payload is identical to ``crash.main.1``'s. 1.81 + 1.82 +hang.plugin.1 1.83 +^^^^^^^^^^^^^ 1.84 + 1.85 +This event is produced when a plugin process hangs. 1.86 + 1.87 +The payload is identical to ``crash.main.1``'s. 1.88 + 1.89 +Aggregated Event Log 1.90 +==================== 1.91 + 1.92 +Crash events are aggregated together into a unified event *log*. Currently, 1.93 +this *log* is really a JSON file. However, this is an implementation detail 1.94 +and it could change at any time. The interface to crash data provided by 1.95 +the JavaScript API is the only supported interface. 1.96 + 1.97 +Design Considerations 1.98 +===================== 1.99 + 1.100 +There are many considerations influencing the design of this subsystem. 1.101 +We attempt to document them in this section. 1.102 + 1.103 +Decoupling of Event Files from Final Data Structure 1.104 +--------------------------------------------------- 1.105 + 1.106 +While it is certainly possible for the Gecko process to write directly to 1.107 +the final data structure on disk, there is an intentional decoupling between 1.108 +the production of events and their transition into final storage. Along the 1.109 +same vein, the choice to have events written to multiple files by producers 1.110 +is deliberate. 1.111 + 1.112 +Some recorded events are written immediately after a process crash. This is 1.113 +a very uncertain time for the host system. There is a high liklihood the 1.114 +system is in an exceptional state, such as memory exhaustion. Therefore, any 1.115 +action taken after crashing needs to be very deliberate about what it does. 1.116 +Excessive memory allocation and certain system calls may cause the system 1.117 +to crash again or the machine's condition to worsen. This means that the act 1.118 +of recording a crash event must be very light weight. Writing a new file from 1.119 +nothing is very light weight. This is one reason we write separate files. 1.120 + 1.121 +Another reason we write separate files is because if the main Gecko process 1.122 +itself crashes (as opposed to say a plugin process), the crash reporter (not 1.123 +Gecko) is running and the crash reporter needs to handle the writing of the 1.124 +event info. If this writing is involved (say loading, parsing, updating, and 1.125 +reserializing back to disk), this logic would need to be implemented in both 1.126 +Gecko and the crash reporter or would need to be implemented in such a way 1.127 +that both could use. Neither of these is very practical from a software 1.128 +lifecycle management perspective. It's much easier to have separate processes 1.129 +write a simple file and to let a single implementation do all the complex 1.130 +work. 1.131 + 1.132 +Idempotent Event Processing 1.133 +=========================== 1.134 + 1.135 +Processing of event files has been designed such that the result is 1.136 +idempotent regardless of what order those files are processed in. This is 1.137 +not only a good design decision, but it is arguably necessary. While event 1.138 +files are processed in order by file mtime, filesystem times may not have 1.139 +the resolution required for proper sorting. Therefore, processing order is 1.140 +merely an optimistic assumption. 1.141 + 1.142 +Aggregated Storage Format 1.143 +========================= 1.144 + 1.145 +Crash events are aggregated into a unified data structure on disk. That data 1.146 +structure is currently LZ4-compressed JSON and is represented by a single file. 1.147 + 1.148 +The choice of a single JSON file was initially driven by time and complexity 1.149 +concerns. Before changing the format or adding significant amounts of new 1.150 +data, some considerations must be taken into account. 1.151 + 1.152 +First, in well-behaving installs, crash data should be minimal. Crashes and 1.153 +hangs will be rare and thus the size of the crash data should remain small 1.154 +over time. 1.155 + 1.156 +The choice of a single JSON file has larger implications as the amount of 1.157 +crash data grows. As new data is accumulated, we need to read and write 1.158 +an entire file to make small updates. LZ4 compression helps reduce I/O. 1.159 +But, there is a potential for unbounded file growth. We establish a 1.160 +limit for the max age of records. Anything older than that limit is 1.161 +pruned. We also establish a daily limit on the number of crashes we will 1.162 +store. All crashes beyond the first N in a day have no payload and are 1.163 +only recorded by the presence of a count. This count ensures we can 1.164 +distinguish between ``N`` and ``100 * N``, which are very different 1.165 +values!