michael@0: ============ michael@0: Crash Events michael@0: ============ michael@0: michael@0: **Crash Events** refers to a special subsystem of Gecko that aims to capture michael@0: events of interest related to process crashing and hanging. michael@0: michael@0: When an event worthy of recording occurs, a file containing that event's michael@0: information is written to a well-defined location on the filesystem. The Gecko michael@0: process periodically scans for produced files and consolidates information michael@0: into a more unified and efficient backend store. michael@0: michael@0: Crash Event Files michael@0: ================= michael@0: michael@0: When a crash-related event occurs, a file describing that event is written michael@0: to a well-defined directory. That directory is likely in the directory of michael@0: the currently-active profile. However, if a profile is not yet active in michael@0: the Gecko process, that directory likely resides in the user's *app data* michael@0: directory (*UAppData* from the directory service). michael@0: michael@0: The filename of the event file is not relevant. However, producers need michael@0: to choose a filename intelligently to avoid name collisions and race michael@0: conditions. Since file locking is potentially dangerous at crash time, michael@0: the convention of generating a UUID and using it as a filename has been michael@0: adopted. michael@0: michael@0: File Format michael@0: ----------- michael@0: michael@0: All crash event files share the same high-level file format. The format michael@0: consists of the following fields delimited by a UNIX newline (*\n*) michael@0: character: michael@0: michael@0: * String event name (valid UTF-8, but likely ASCII) michael@0: * String representation of integer seconds since UNIX epoch michael@0: * Payload michael@0: michael@0: The payload is event specific and may contain UNIX newline characters. michael@0: The recommended method for parsing is to split at most 3 times on UNIX michael@0: newline and then dispatch to an event-specific parsed based on the michael@0: event name. michael@0: michael@0: If an unknown event type is encountered, the event can safely be ignored michael@0: until later. This helps ensure that application downgrades (potentially michael@0: due to elevated crash rate) don't result in data loss. michael@0: michael@0: The format and semantics of each event type are meant to be constant once michael@0: that event type is committed to the main Firefox repository. If new metadata michael@0: needs to be captured or the meaning of data captured in an event changes, michael@0: that change should be expressed through the invention of a new event type. michael@0: For this reason, event names are highly recommended to contain a version. michael@0: e.g. instead of a *Gecko process crashed* event, we prefer a *Gecko process michael@0: crashed v1* event. michael@0: michael@0: Event Types michael@0: ----------- michael@0: michael@0: Each subsection documents the different types of crash events that may be michael@0: produced. Each section name corresponds to the first line of the crash michael@0: event file. michael@0: michael@0: crash.main.1 michael@0: ^^^^^^^^^^^^ michael@0: michael@0: This event is produced when the main process crashes. michael@0: michael@0: The payload of this event is the string crash ID, very likely a UUID. michael@0: There should be ``UUID.dmp`` and ``UUID.extra`` files on disk, saved by michael@0: Breakpad. michael@0: michael@0: crash.plugin.1 michael@0: ^^^^^^^^^^^^^^ michael@0: michael@0: This event is produced when a plugin process crashes. michael@0: michael@0: The payload is identical to ``crash.main.1``'s. michael@0: michael@0: hang.plugin.1 michael@0: ^^^^^^^^^^^^^ michael@0: michael@0: This event is produced when a plugin process hangs. michael@0: michael@0: The payload is identical to ``crash.main.1``'s. michael@0: michael@0: Aggregated Event Log michael@0: ==================== michael@0: michael@0: Crash events are aggregated together into a unified event *log*. Currently, michael@0: this *log* is really a JSON file. However, this is an implementation detail michael@0: and it could change at any time. The interface to crash data provided by michael@0: the JavaScript API is the only supported interface. michael@0: michael@0: Design Considerations michael@0: ===================== michael@0: michael@0: There are many considerations influencing the design of this subsystem. michael@0: We attempt to document them in this section. michael@0: michael@0: Decoupling of Event Files from Final Data Structure michael@0: --------------------------------------------------- michael@0: michael@0: While it is certainly possible for the Gecko process to write directly to michael@0: the final data structure on disk, there is an intentional decoupling between michael@0: the production of events and their transition into final storage. Along the michael@0: same vein, the choice to have events written to multiple files by producers michael@0: is deliberate. michael@0: michael@0: Some recorded events are written immediately after a process crash. This is michael@0: a very uncertain time for the host system. There is a high liklihood the michael@0: system is in an exceptional state, such as memory exhaustion. Therefore, any michael@0: action taken after crashing needs to be very deliberate about what it does. michael@0: Excessive memory allocation and certain system calls may cause the system michael@0: to crash again or the machine's condition to worsen. This means that the act michael@0: of recording a crash event must be very light weight. Writing a new file from michael@0: nothing is very light weight. This is one reason we write separate files. michael@0: michael@0: Another reason we write separate files is because if the main Gecko process michael@0: itself crashes (as opposed to say a plugin process), the crash reporter (not michael@0: Gecko) is running and the crash reporter needs to handle the writing of the michael@0: event info. If this writing is involved (say loading, parsing, updating, and michael@0: reserializing back to disk), this logic would need to be implemented in both michael@0: Gecko and the crash reporter or would need to be implemented in such a way michael@0: that both could use. Neither of these is very practical from a software michael@0: lifecycle management perspective. It's much easier to have separate processes michael@0: write a simple file and to let a single implementation do all the complex michael@0: work. michael@0: michael@0: Idempotent Event Processing michael@0: =========================== michael@0: michael@0: Processing of event files has been designed such that the result is michael@0: idempotent regardless of what order those files are processed in. This is michael@0: not only a good design decision, but it is arguably necessary. While event michael@0: files are processed in order by file mtime, filesystem times may not have michael@0: the resolution required for proper sorting. Therefore, processing order is michael@0: merely an optimistic assumption. michael@0: michael@0: Aggregated Storage Format michael@0: ========================= michael@0: michael@0: Crash events are aggregated into a unified data structure on disk. That data michael@0: structure is currently LZ4-compressed JSON and is represented by a single file. michael@0: michael@0: The choice of a single JSON file was initially driven by time and complexity michael@0: concerns. Before changing the format or adding significant amounts of new michael@0: data, some considerations must be taken into account. michael@0: michael@0: First, in well-behaving installs, crash data should be minimal. Crashes and michael@0: hangs will be rare and thus the size of the crash data should remain small michael@0: over time. michael@0: michael@0: The choice of a single JSON file has larger implications as the amount of michael@0: crash data grows. As new data is accumulated, we need to read and write michael@0: an entire file to make small updates. LZ4 compression helps reduce I/O. michael@0: But, there is a potential for unbounded file growth. We establish a michael@0: limit for the max age of records. Anything older than that limit is michael@0: pruned. We also establish a daily limit on the number of crashes we will michael@0: store. All crashes beyond the first N in a day have no payload and are michael@0: only recorded by the presence of a count. This count ensures we can michael@0: distinguish between ``N`` and ``100 * N``, which are very different michael@0: values!