toolkit/components/crashes/docs/crash-events.rst

changeset 0
6474c204b198
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/toolkit/components/crashes/docs/crash-events.rst	Wed Dec 31 06:09:35 2014 +0100
     1.3 @@ -0,0 +1,162 @@
     1.4 +============
     1.5 +Crash Events
     1.6 +============
     1.7 +
     1.8 +**Crash Events** refers to a special subsystem of Gecko that aims to capture
     1.9 +events of interest related to process crashing and hanging.
    1.10 +
    1.11 +When an event worthy of recording occurs, a file containing that event's
    1.12 +information is written to a well-defined location on the filesystem. The Gecko
    1.13 +process periodically scans for produced files and consolidates information
    1.14 +into a more unified and efficient backend store.
    1.15 +
    1.16 +Crash Event Files
    1.17 +=================
    1.18 +
    1.19 +When a crash-related event occurs, a file describing that event is written
    1.20 +to a well-defined directory. That directory is likely in the directory of
    1.21 +the currently-active profile. However, if a profile is not yet active in
    1.22 +the Gecko process, that directory likely resides in the user's *app data*
    1.23 +directory (*UAppData* from the directory service).
    1.24 +
    1.25 +The filename of the event file is not relevant. However, producers need
    1.26 +to choose a filename intelligently to avoid name collisions and race
    1.27 +conditions. Since file locking is potentially dangerous at crash time,
    1.28 +the convention of generating a UUID and using it as a filename has been
    1.29 +adopted.
    1.30 +
    1.31 +File Format
    1.32 +-----------
    1.33 +
    1.34 +All crash event files share the same high-level file format. The format
    1.35 +consists of the following fields delimited by a UNIX newline (*\n*)
    1.36 +character:
    1.37 +
    1.38 +* String event name (valid UTF-8, but likely ASCII)
    1.39 +* String representation of integer seconds since UNIX epoch
    1.40 +* Payload
    1.41 +
    1.42 +The payload is event specific and may contain UNIX newline characters.
    1.43 +The recommended method for parsing is to split at most 3 times on UNIX
    1.44 +newline and then dispatch to an event-specific parsed based on the
    1.45 +event name.
    1.46 +
    1.47 +If an unknown event type is encountered, the event can safely be ignored
    1.48 +until later. This helps ensure that application downgrades (potentially
    1.49 +due to elevated crash rate) don't result in data loss.
    1.50 +
    1.51 +The format and semantics of each event type are meant to be constant once
    1.52 +that event type is committed to the main Firefox repository. If new metadata
    1.53 +needs to be captured or the meaning of data captured in an event changes,
    1.54 +that change should be expressed through the invention of a new event type.
    1.55 +For this reason, event names are highly recommended to contain a version.
    1.56 +e.g. instead of a *Gecko process crashed* event, we prefer a *Gecko process
    1.57 +crashed v1* event.
    1.58 +
    1.59 +Event Types
    1.60 +-----------
    1.61 +
    1.62 +Each subsection documents the different types of crash events that may be
    1.63 +produced. Each section name corresponds to the first line of the crash
    1.64 +event file.
    1.65 +
    1.66 +crash.main.1
    1.67 +^^^^^^^^^^^^
    1.68 +
    1.69 +This event is produced when the main process crashes.
    1.70 +
    1.71 +The payload of this event is the string crash ID, very likely a UUID.
    1.72 +There should be ``UUID.dmp`` and ``UUID.extra`` files on disk, saved by
    1.73 +Breakpad.
    1.74 +
    1.75 +crash.plugin.1
    1.76 +^^^^^^^^^^^^^^
    1.77 +
    1.78 +This event is produced when a plugin process crashes.
    1.79 +
    1.80 +The payload is identical to ``crash.main.1``'s.
    1.81 +
    1.82 +hang.plugin.1
    1.83 +^^^^^^^^^^^^^
    1.84 +
    1.85 +This event is produced when a plugin process hangs.
    1.86 +
    1.87 +The payload is identical to ``crash.main.1``'s.
    1.88 +
    1.89 +Aggregated Event Log
    1.90 +====================
    1.91 +
    1.92 +Crash events are aggregated together into a unified event *log*. Currently,
    1.93 +this *log* is really a JSON file. However, this is an implementation detail
    1.94 +and it could change at any time. The interface to crash data provided by
    1.95 +the JavaScript API is the only supported interface.
    1.96 +
    1.97 +Design Considerations
    1.98 +=====================
    1.99 +
   1.100 +There are many considerations influencing the design of this subsystem.
   1.101 +We attempt to document them in this section.
   1.102 +
   1.103 +Decoupling of Event Files from Final Data Structure
   1.104 +---------------------------------------------------
   1.105 +
   1.106 +While it is certainly possible for the Gecko process to write directly to
   1.107 +the final data structure on disk, there is an intentional decoupling between
   1.108 +the production of events and their transition into final storage. Along the
   1.109 +same vein, the choice to have events written to multiple files by producers
   1.110 +is deliberate.
   1.111 +
   1.112 +Some recorded events are written immediately after a process crash. This is
   1.113 +a very uncertain time for the host system. There is a high liklihood the
   1.114 +system is in an exceptional state, such as memory exhaustion. Therefore, any
   1.115 +action taken after crashing needs to be very deliberate about what it does.
   1.116 +Excessive memory allocation and certain system calls may cause the system
   1.117 +to crash again or the machine's condition to worsen. This means that the act
   1.118 +of recording a crash event must be very light weight. Writing a new file from
   1.119 +nothing is very light weight. This is one reason we write separate files.
   1.120 +
   1.121 +Another reason we write separate files is because if the main Gecko process
   1.122 +itself crashes (as opposed to say a plugin process), the crash reporter (not
   1.123 +Gecko) is running and the crash reporter needs to handle the writing of the
   1.124 +event info. If this writing is involved (say loading, parsing, updating, and
   1.125 +reserializing back to disk), this logic would need to be implemented in both
   1.126 +Gecko and the crash reporter or would need to be implemented in such a way
   1.127 +that both could use. Neither of these is very practical from a software
   1.128 +lifecycle management perspective. It's much easier to have separate processes
   1.129 +write a simple file and to let a single implementation do all the complex
   1.130 +work.
   1.131 +
   1.132 +Idempotent Event Processing
   1.133 +===========================
   1.134 +
   1.135 +Processing of event files has been designed such that the result is
   1.136 +idempotent regardless of what order those files are processed in. This is
   1.137 +not only a good design decision, but it is arguably necessary. While event
   1.138 +files are processed in order by file mtime, filesystem times may not have
   1.139 +the resolution required for proper sorting. Therefore, processing order is
   1.140 +merely an optimistic assumption.
   1.141 +
   1.142 +Aggregated Storage Format
   1.143 +=========================
   1.144 +
   1.145 +Crash events are aggregated into a unified data structure on disk. That data
   1.146 +structure is currently LZ4-compressed JSON and is represented by a single file.
   1.147 +
   1.148 +The choice of a single JSON file was initially driven by time and complexity
   1.149 +concerns. Before changing the format or adding significant amounts of new
   1.150 +data, some considerations must be taken into account.
   1.151 +
   1.152 +First, in well-behaving installs, crash data should be minimal. Crashes and
   1.153 +hangs will be rare and thus the size of the crash data should remain small
   1.154 +over time.
   1.155 +
   1.156 +The choice of a single JSON file has larger implications as the amount of
   1.157 +crash data grows. As new data is accumulated, we need to read and write
   1.158 +an entire file to make small updates. LZ4 compression helps reduce I/O.
   1.159 +But, there is a potential for unbounded file growth. We establish a
   1.160 +limit for the max age of records. Anything older than that limit is
   1.161 +pruned. We also establish a daily limit on the number of crashes we will
   1.162 +store. All crashes beyond the first N in a day have no payload and are
   1.163 +only recorded by the presence of a count. This count ensures we can
   1.164 +distinguish between ``N`` and ``100 * N``, which are very different
   1.165 +values!

mercurial