michael@0: ============
michael@0: Crash Events
michael@0: ============
michael@0: 
michael@0: **Crash Events** refers to a special subsystem of Gecko that aims to capture
michael@0: events of interest related to process crashing and hanging.
michael@0: 
michael@0: When an event worthy of recording occurs, a file containing that event's
michael@0: information is written to a well-defined location on the filesystem. The Gecko
michael@0: process periodically scans for produced files and consolidates information
michael@0: into a more unified and efficient backend store.
michael@0: 
michael@0: Crash Event Files
michael@0: =================
michael@0: 
michael@0: When a crash-related event occurs, a file describing that event is written
michael@0: to a well-defined directory. That directory is likely in the directory of
michael@0: the currently-active profile. However, if a profile is not yet active in
michael@0: the Gecko process, that directory likely resides in the user's *app data*
michael@0: directory (*UAppData* from the directory service).
michael@0: 
michael@0: The filename of the event file is not relevant. However, producers need
michael@0: to choose a filename intelligently to avoid name collisions and race
michael@0: conditions. Since file locking is potentially dangerous at crash time,
michael@0: the convention of generating a UUID and using it as a filename has been
michael@0: adopted.
michael@0: 
michael@0: File Format
michael@0: -----------
michael@0: 
michael@0: All crash event files share the same high-level file format. The format
michael@0: consists of the following fields delimited by a UNIX newline (*\n*)
michael@0: character:
michael@0: 
michael@0: * String event name (valid UTF-8, but likely ASCII)
michael@0: * String representation of integer seconds since UNIX epoch
michael@0: * Payload
michael@0: 
michael@0: The payload is event specific and may contain UNIX newline characters.
michael@0: The recommended method for parsing is to split at most 3 times on UNIX
michael@0: newline and then dispatch to an event-specific parsed based on the
michael@0: event name.
michael@0: 
michael@0: If an unknown event type is encountered, the event can safely be ignored
michael@0: until later. This helps ensure that application downgrades (potentially
michael@0: due to elevated crash rate) don't result in data loss.
michael@0: 
michael@0: The format and semantics of each event type are meant to be constant once
michael@0: that event type is committed to the main Firefox repository. If new metadata
michael@0: needs to be captured or the meaning of data captured in an event changes,
michael@0: that change should be expressed through the invention of a new event type.
michael@0: For this reason, event names are highly recommended to contain a version.
michael@0: e.g. instead of a *Gecko process crashed* event, we prefer a *Gecko process
michael@0: crashed v1* event.
michael@0: 
michael@0: Event Types
michael@0: -----------
michael@0: 
michael@0: Each subsection documents the different types of crash events that may be
michael@0: produced. Each section name corresponds to the first line of the crash
michael@0: event file.
michael@0: 
michael@0: crash.main.1
michael@0: ^^^^^^^^^^^^
michael@0: 
michael@0: This event is produced when the main process crashes.
michael@0: 
michael@0: The payload of this event is the string crash ID, very likely a UUID.
michael@0: There should be ``UUID.dmp`` and ``UUID.extra`` files on disk, saved by
michael@0: Breakpad.
michael@0: 
michael@0: crash.plugin.1
michael@0: ^^^^^^^^^^^^^^
michael@0: 
michael@0: This event is produced when a plugin process crashes.
michael@0: 
michael@0: The payload is identical to ``crash.main.1``'s.
michael@0: 
michael@0: hang.plugin.1
michael@0: ^^^^^^^^^^^^^
michael@0: 
michael@0: This event is produced when a plugin process hangs.
michael@0: 
michael@0: The payload is identical to ``crash.main.1``'s.
michael@0: 
michael@0: Aggregated Event Log
michael@0: ====================
michael@0: 
michael@0: Crash events are aggregated together into a unified event *log*. Currently,
michael@0: this *log* is really a JSON file. However, this is an implementation detail
michael@0: and it could change at any time. The interface to crash data provided by
michael@0: the JavaScript API is the only supported interface.
michael@0: 
michael@0: Design Considerations
michael@0: =====================
michael@0: 
michael@0: There are many considerations influencing the design of this subsystem.
michael@0: We attempt to document them in this section.
michael@0: 
michael@0: Decoupling of Event Files from Final Data Structure
michael@0: ---------------------------------------------------
michael@0: 
michael@0: While it is certainly possible for the Gecko process to write directly to
michael@0: the final data structure on disk, there is an intentional decoupling between
michael@0: the production of events and their transition into final storage. Along the
michael@0: same vein, the choice to have events written to multiple files by producers
michael@0: is deliberate.
michael@0: 
michael@0: Some recorded events are written immediately after a process crash. This is
michael@0: a very uncertain time for the host system. There is a high liklihood the
michael@0: system is in an exceptional state, such as memory exhaustion. Therefore, any
michael@0: action taken after crashing needs to be very deliberate about what it does.
michael@0: Excessive memory allocation and certain system calls may cause the system
michael@0: to crash again or the machine's condition to worsen. This means that the act
michael@0: of recording a crash event must be very light weight. Writing a new file from
michael@0: nothing is very light weight. This is one reason we write separate files.
michael@0: 
michael@0: Another reason we write separate files is because if the main Gecko process
michael@0: itself crashes (as opposed to say a plugin process), the crash reporter (not
michael@0: Gecko) is running and the crash reporter needs to handle the writing of the
michael@0: event info. If this writing is involved (say loading, parsing, updating, and
michael@0: reserializing back to disk), this logic would need to be implemented in both
michael@0: Gecko and the crash reporter or would need to be implemented in such a way
michael@0: that both could use. Neither of these is very practical from a software
michael@0: lifecycle management perspective. It's much easier to have separate processes
michael@0: write a simple file and to let a single implementation do all the complex
michael@0: work.
michael@0: 
michael@0: Idempotent Event Processing
michael@0: ===========================
michael@0: 
michael@0: Processing of event files has been designed such that the result is
michael@0: idempotent regardless of what order those files are processed in. This is
michael@0: not only a good design decision, but it is arguably necessary. While event
michael@0: files are processed in order by file mtime, filesystem times may not have
michael@0: the resolution required for proper sorting. Therefore, processing order is
michael@0: merely an optimistic assumption.
michael@0: 
michael@0: Aggregated Storage Format
michael@0: =========================
michael@0: 
michael@0: Crash events are aggregated into a unified data structure on disk. That data
michael@0: structure is currently LZ4-compressed JSON and is represented by a single file.
michael@0: 
michael@0: The choice of a single JSON file was initially driven by time and complexity
michael@0: concerns. Before changing the format or adding significant amounts of new
michael@0: data, some considerations must be taken into account.
michael@0: 
michael@0: First, in well-behaving installs, crash data should be minimal. Crashes and
michael@0: hangs will be rare and thus the size of the crash data should remain small
michael@0: over time.
michael@0: 
michael@0: The choice of a single JSON file has larger implications as the amount of
michael@0: crash data grows. As new data is accumulated, we need to read and write
michael@0: an entire file to make small updates. LZ4 compression helps reduce I/O.
michael@0: But, there is a potential for unbounded file growth. We establish a
michael@0: limit for the max age of records. Anything older than that limit is
michael@0: pruned. We also establish a daily limit on the number of crashes we will
michael@0: store. All crashes beyond the first N in a day have no payload and are
michael@0: only recorded by the presence of a count. This count ensures we can
michael@0: distinguish between ``N`` and ``100 * N``, which are very different
michael@0: values!