Skip to content

Experiment Journal

Introduction

This document describes the syntax and grammar of a Chaos Engineering experiment’s output, called a journal. One should be able to walkthrough the experiment’s run through its journal alone.

This specification makes sense in regards to the Experiment specification itself. It is indeed a mirror to that document.

Conventions Used in This Document

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

The terms “JSON”, “JSON text”, “JSON value”, “member”, “element”, “object”, “array”, “number”, “string”, “boolean”, “true”, “false”, and “null” in this document are to be interpreted as defined in RFC 7159.

Other formats

While this specification uses JSON to define its elements, implementations may allow loading from other formats, such as YAML. As long as the output of such format respects the specification herein.

Journal Elements

Overview

A Journal is one potential output of a Chaos Engineering experiment. The objective of such journal is to collect all events that took place during the experiment’s run.

The journal contains static information, such as the experiment that was run, as well as runtime entries.

Format

A journal is a JSON object.

Required Properties

A journal MUST declare:

  • a experiment property
  • a status property
  • a start property
  • a end property
  • a duration property
  • a deviated property

The experiment property MUST be a copy of the run experiment as-is and is therefore a JSON object. It MUST follow the Experiment definition.

The ̀status property MUST be a JSON string, one of:

  • "completed": when the experiment runs fully. It does not indicate the activities in the experiment succeeded, only that they were executed as expected
  • "failed": when one of the activity reports a failed condition
  • "aborted": when the experiment breaks for unforeseen reason
  • "interrupted": when the experiment is interrupted (for instance after a signal is received)

Note

It is important to understand the "completed" status expresses that everything ran all the way. An action may not have resulted in what the operator wanted but it did not fail. Always review the entire journal to fully appreciate the actual outcome of the experiment.

There are two reasons it could be marked as "failed". Either a tolerance failed or if an extension made a check for a condition. So, for instance, let’s say the extension made a HTTP call to your service, that call returned a 400 rather than 200. If the extension was not designed to care for this difference, then the status will be marked as "completed". However, if the extension validated the HTTP response, it may have decided to fail the action which would lead to a ̀ “failed”` status.

The "aborted" and "interrupted" are different, the former means of a crash somehow (say, because of a bug). The latter indicates a signal was received. Both MUST bail the entire process.

The deviated property MUST be a boolean flag that MUST be set to true when the steady-state was executed after the experimental method but reported that, at least, one of its probes failed to match the expected tolerance.

The start property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The end property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The duration MUST be a JSON number of difference between the end datetime and the start datetime.

In addition to those required properties, the journal SHOULD also declare the followings:

  • a steady_states property
  • a run property
  • a rollbacks property

The steady_states property MUST be a JSON object defining the result of the steady state hypothesis outcome.

The run property MUST be a JSON array defining the result of each activity in the method element of the experiment.

The rollbacks property MUST be a JSON array defining the result of each action in the rollbacks element of the experiment.

Optional Properties

In addition to those required properties, the journal MAY also declare the followings:

  • a platform property
  • a node property

The platform property MUST be a JSON string defining the machine on which the experiment was executed. The content is free form but may be similar to the output of the uname -a command.

The node property MUST be a JSON string representing the name of the machine where the experiment was run. The content is free form.

Steady State

The steady_states property holds the outcomes of the steady state hypothesis.

The steady_states property MAY declare the following properties:

  • a before property
  • a after property

The before property MUST be a JSON object describing the outcome of the hypothesis run before the method is executed.

The after property MUST be a JSON object describing the outcome of the hypothesis run after the method is executed.

Notice that either of those properties MAY be missing if they were not run.

Steady State Outcomes

Both the before and after properties follow the same definition.

  • a steady_state_met property
  • a probes property

The steady_state_met property MUST be a JSON boolean. It MUST be true if the steady state hypothesis was met, false otherwise.

A steady state is met when all its probes matched their tolerance. A steady state is not met at the first non-matching tolerance probe.

The probes property MUST be a JSON array of probes results.

Each probe result MUST declare the following properties:

  • a activity property
  • a status property
  • a start property
  • a end property
  • a duration property
  • a tolerance_met property
  • a output property

The activity property MUST be a JSON object, a raw copy of the executed probe.

The status property MUST be a JSON string, one of "succeeded" or "failed".

The start property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The end property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The duration MUST be a JSON number of difference between the end datetime and the start datetime.

The tolerance_met MUST be a JSON boolean indicating if the probe matched its tolerance or not.

The output MUST be a JSON string or null.

In addition, the probe result MAY contain an additional property:

  • a exception property

This property is set when the probe failed in an unforeseeable way and MUST be a JSON array or JSON string of the error trace.

Run

The run property holds the outcomes of the method element.

The run property MUST be a JSON array of activity results.

Each activity result MUST declare the following properties:

  • a activity property
  • a status property
  • a start property
  • a end property
  • a duration property
  • a output property

The activity property MUST be a JSON object, a raw copy of the executed probe or action.

The status property MUST be a JSON string, one of "succeeded" or "failed".

The start property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The end property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The duration MUST be a JSON number of difference between the end datetime and the start datetime.

The output MUST be a JSON string or null.

In addition, the activity result MAY contain an additional property:

  • a exception property

This property is set when the activity failed in an unforeseeable way and MUST be a JSON array or JSON string of the error trace.

Rollbacks

The rollbacks property holds the outcomes of the rollbacks element.

The rollbacks property MUST be a JSON array of action results.

Each action result MUST declare the following properties:

  • a activity property
  • a status property
  • a start property
  • a end property
  • a duration property
  • a output property

The activity property MUST be a JSON object, a raw copy of the executed action.

The status property MUST be a JSON string, one of "succeeded" or "failed".

The start property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The end property MUST be a JSON string formatted as per RFC 3339 in UTC timezone.

The duration MUST be a JSON number of difference between the end datetime and the start datetime.

The output MUST be a JSON string or null.

In addition, the activity result MAY contain an additional property:

  • a exception property

This property is set when the action failed in an unforeseeable way and MUST be a JSON array or JSON string of the error trace.

Rollbacks MUST NOT to be applied when the experiment status is "interrupted".