JMWE Scheduler Job Service Disruption

Incident Report for JMWE Cloud

Postmortem

2025-01-28 Incident: Post-mortem

Summary

At approximately 00:30 UTC, the JMWE database started experiencing intermittent restarts, resulting in the following issues:

Web items, such as configuration pages and post-function details in workflows, could not be reliably displayed.
Workflow post-functions were not processed as expected.
Installation events may not have been acknowledged.
Note: Conditions and validators were unaffected.

Our team promptly investigated the issue and deployed a fix that restored functionality to web items and workflow post-functions. However, a related issue created a significant backlog in our processing queue, leading to degraded performance for Scheduled Actions until approximately 09:00 UTC.

As of the time of this report, the issue has been fully resolved. The development team continues to closely monitor the platform’s health, and we will provide further updates if necessary.

Root cause analysis

The disruption was caused by changes to Atlassian’s “Search Issues” API, which JMWE uses to determine targets for Scheduled Actions. Atlassian deprecated the existing version of this API and introduced a new version with different pagination behavior.

This difference triggered an unexpected condition in our code, allowing some actions to process an unbounded number of issues. Consequently, a few misconfigured actions—previously limited by checks in our code—generated more events than our system could handle within its scaling capacity.

Potential impact

Post-functions in workflows, Event-based Actions or Scheduled Actions might have worked intermittently during the night between Jan 27, 2025 and Jan 28, 2025.
Scheduled Actions until around 09:00 UTC on Jan 28, 2025 might have not been executed, or have been executed later than they were originally scheduled.
- Our scheduler tries to reprocess any action that wasn’t handled during the previously scheduled runs, for example if your action is meant to run at 08:00 AM and it fails, it will be retried at 09:00 AM and on the subsequent schedules.
App installation events might have not been processed. If your instance displays messages such as “Could not load base context” you might have to reinstall JMWE.

Next steps

JMWE has been updated to handle significantly larger volumes of issues while ensuring graceful degradation during unexpected spikes.
We are revising our internal infrastructure to process up to 10 times the current peak loads

We deeply apologize for the inconvenience this disruption may have caused. Ensuring the reliability of JMWE is our highest priority, and we are committed to learning from this incident to provide a more robust and scalable platform.

If you have any questions or concerns, please don’t hesitate to contact our support team.

Posted Jan 28, 2025 - 10:43 EST

Resolved

Our team identified the root cause as unexpected data impacting the database. This has been addressed, and we’re continuing to monitor the app’s health to ensure everything remains stable.

Thank you for your patience.

Posted Jan 28, 2025 - 04:30 EST

Investigating

We are currently experiencing an issue with the JMWE Scheduler job service in the backend, which is unable to process messages. As a result, configured Schedule Actions are not functioning as expected. Our team is actively investigating the root cause of the issue. Please note that all other JMWE functionalities remain fully operational. Thank you for your patience and understanding.

Posted Jan 27, 2025 - 19:30 EST

This incident affected: JMWE for JIRA Cloud.