toolkit/components/crashes/docs/crash-events.rst

Thu, 22 Jan 2015 13:21:57 +0100

author
Michael Schloh von Bennewitz <michael@schloh.com>
date
Thu, 22 Jan 2015 13:21:57 +0100
branch
TOR_BUG_9701
changeset 15
b8a032363ba2
permissions
-rw-r--r--

Incorporate requested changes from Mozilla in review:
https://bugzilla.mozilla.org/show_bug.cgi?id=1123480#c6

     1 ============
     2 Crash Events
     3 ============
     5 **Crash Events** refers to a special subsystem of Gecko that aims to capture
     6 events of interest related to process crashing and hanging.
     8 When an event worthy of recording occurs, a file containing that event's
     9 information is written to a well-defined location on the filesystem. The Gecko
    10 process periodically scans for produced files and consolidates information
    11 into a more unified and efficient backend store.
    13 Crash Event Files
    14 =================
    16 When a crash-related event occurs, a file describing that event is written
    17 to a well-defined directory. That directory is likely in the directory of
    18 the currently-active profile. However, if a profile is not yet active in
    19 the Gecko process, that directory likely resides in the user's *app data*
    20 directory (*UAppData* from the directory service).
    22 The filename of the event file is not relevant. However, producers need
    23 to choose a filename intelligently to avoid name collisions and race
    24 conditions. Since file locking is potentially dangerous at crash time,
    25 the convention of generating a UUID and using it as a filename has been
    26 adopted.
    28 File Format
    29 -----------
    31 All crash event files share the same high-level file format. The format
    32 consists of the following fields delimited by a UNIX newline (*\n*)
    33 character:
    35 * String event name (valid UTF-8, but likely ASCII)
    36 * String representation of integer seconds since UNIX epoch
    37 * Payload
    39 The payload is event specific and may contain UNIX newline characters.
    40 The recommended method for parsing is to split at most 3 times on UNIX
    41 newline and then dispatch to an event-specific parsed based on the
    42 event name.
    44 If an unknown event type is encountered, the event can safely be ignored
    45 until later. This helps ensure that application downgrades (potentially
    46 due to elevated crash rate) don't result in data loss.
    48 The format and semantics of each event type are meant to be constant once
    49 that event type is committed to the main Firefox repository. If new metadata
    50 needs to be captured or the meaning of data captured in an event changes,
    51 that change should be expressed through the invention of a new event type.
    52 For this reason, event names are highly recommended to contain a version.
    53 e.g. instead of a *Gecko process crashed* event, we prefer a *Gecko process
    54 crashed v1* event.
    56 Event Types
    57 -----------
    59 Each subsection documents the different types of crash events that may be
    60 produced. Each section name corresponds to the first line of the crash
    61 event file.
    63 crash.main.1
    64 ^^^^^^^^^^^^
    66 This event is produced when the main process crashes.
    68 The payload of this event is the string crash ID, very likely a UUID.
    69 There should be ``UUID.dmp`` and ``UUID.extra`` files on disk, saved by
    70 Breakpad.
    72 crash.plugin.1
    73 ^^^^^^^^^^^^^^
    75 This event is produced when a plugin process crashes.
    77 The payload is identical to ``crash.main.1``'s.
    79 hang.plugin.1
    80 ^^^^^^^^^^^^^
    82 This event is produced when a plugin process hangs.
    84 The payload is identical to ``crash.main.1``'s.
    86 Aggregated Event Log
    87 ====================
    89 Crash events are aggregated together into a unified event *log*. Currently,
    90 this *log* is really a JSON file. However, this is an implementation detail
    91 and it could change at any time. The interface to crash data provided by
    92 the JavaScript API is the only supported interface.
    94 Design Considerations
    95 =====================
    97 There are many considerations influencing the design of this subsystem.
    98 We attempt to document them in this section.
   100 Decoupling of Event Files from Final Data Structure
   101 ---------------------------------------------------
   103 While it is certainly possible for the Gecko process to write directly to
   104 the final data structure on disk, there is an intentional decoupling between
   105 the production of events and their transition into final storage. Along the
   106 same vein, the choice to have events written to multiple files by producers
   107 is deliberate.
   109 Some recorded events are written immediately after a process crash. This is
   110 a very uncertain time for the host system. There is a high liklihood the
   111 system is in an exceptional state, such as memory exhaustion. Therefore, any
   112 action taken after crashing needs to be very deliberate about what it does.
   113 Excessive memory allocation and certain system calls may cause the system
   114 to crash again or the machine's condition to worsen. This means that the act
   115 of recording a crash event must be very light weight. Writing a new file from
   116 nothing is very light weight. This is one reason we write separate files.
   118 Another reason we write separate files is because if the main Gecko process
   119 itself crashes (as opposed to say a plugin process), the crash reporter (not
   120 Gecko) is running and the crash reporter needs to handle the writing of the
   121 event info. If this writing is involved (say loading, parsing, updating, and
   122 reserializing back to disk), this logic would need to be implemented in both
   123 Gecko and the crash reporter or would need to be implemented in such a way
   124 that both could use. Neither of these is very practical from a software
   125 lifecycle management perspective. It's much easier to have separate processes
   126 write a simple file and to let a single implementation do all the complex
   127 work.
   129 Idempotent Event Processing
   130 ===========================
   132 Processing of event files has been designed such that the result is
   133 idempotent regardless of what order those files are processed in. This is
   134 not only a good design decision, but it is arguably necessary. While event
   135 files are processed in order by file mtime, filesystem times may not have
   136 the resolution required for proper sorting. Therefore, processing order is
   137 merely an optimistic assumption.
   139 Aggregated Storage Format
   140 =========================
   142 Crash events are aggregated into a unified data structure on disk. That data
   143 structure is currently LZ4-compressed JSON and is represented by a single file.
   145 The choice of a single JSON file was initially driven by time and complexity
   146 concerns. Before changing the format or adding significant amounts of new
   147 data, some considerations must be taken into account.
   149 First, in well-behaving installs, crash data should be minimal. Crashes and
   150 hangs will be rare and thus the size of the crash data should remain small
   151 over time.
   153 The choice of a single JSON file has larger implications as the amount of
   154 crash data grows. As new data is accumulated, we need to read and write
   155 an entire file to make small updates. LZ4 compression helps reduce I/O.
   156 But, there is a potential for unbounded file growth. We establish a
   157 limit for the max age of records. Anything older than that limit is
   158 pruned. We also establish a daily limit on the number of crashes we will
   159 store. All crashes beyond the first N in a day have no payload and are
   160 only recorded by the presence of a count. This count ensures we can
   161 distinguish between ``N`` and ``100 * N``, which are very different
   162 values!

mercurial