システム障害対応に困っているあなたに ちょっと役立つメディア

Welcome to CO-TROU-SH!

Welcome to CO-TROU-SH!

CO-TROU-SH is a media that was established for the principle of helping those who are struggling with system failure response. We will explain this media overview, focusing on what cooperation is about.


Table of Contents

  • Self-Introduction of the Founder
  • What is the System Failure Response that Makes a Huge Impact Through Cooperation?
  • Plan for Approaching the Challenges in System Failure Response
  • Mission & Vision
  • What is Co-trou-sh (Kotrash) about?

Self-Introduction of the Founder

KOJI NOMURA has been responsible for the development, maintenance, and improvement of financial systems for 12 years. For six years, he lived a life of receiving alert calls six times a day and rushing to the site in the middle of the night twice a week. Through this experience, he developed a strong desire to improve this situation and spent the next six years focusing on improving maintenance and operations, especially in system failure response.

He analyzed approximately 1,000 cases of system failures. In improving system failure responses, he achieved significant results, such as reducing alerts by 90%. Currently, he is in charge of planning and operating the incident response service ‘XonOps’, aimed at minimizing the impact on end-users. He is engaged in improving system failure responses for over 100 internal and external teams.

What is the System Failure Response that Makes a Huge Impact Through Cooperation?

“Cooperation” refers to development teams and user companies working together to address system failures.

Of course, we understand that for some, this may seem impossible, or you might feel that your customers would not engage in such a way. Despite these circumstances, we still believe in making cooperation and mutual assistance the standard in system failure response.

We believe that system failure response can minimize the inconveniences for end-users by collaborating and helping each other among various stakeholders. When IT services are built, they are developed by many parties, both inside and outside the company, and they often stand as a single product in cooperation with other IT services.

Therefore, if a failure occurs in an IT service built in this way, it seems natural and logical to overcome it with many stakeholders, just as it was during the construction phase.

Plan for Approaching the Challenges in System Failure Response

3-1. Lack of Progress in Improvement

Addressing the challenge of not knowing how to take the first step in improving system failure response, we acknowledge that there is a scarcity of know-how in this area. Therefore, we plan to create a space for sharing knowledge by building books, media, and communities. In addition to introducing and explaining guidelines from ITIL and the Ministry of Internal Affairs and Communications, our goal is to eventually create a forum where everyone can discuss and consult on issues related to system failure response.

3-2. Numerous/Complex Alerts

We will address issues for those struggling with an overwhelming number of alerts, such as being unable to sleep due to constant phone calls, non-stop flashing patrol lamps, or receiving dozens of emails a day. We also aim to tackle challenges where the complexity of alert content leads to over-reliance on veteran employees. Our goal is to carry out improvement activities to assist those troubled by these alert-related issues.

3-3. Person-Dependent Failure Response

For the challenge of system failure responses being person-dependent, such as “unable to gather people or manage tasks” or “failure responses being too reliant on specific individuals,” we have systematized methods of organization. Moving forward, we plan to practice these methods with everyone through workshops and study group-like formats. We also intend to introduce and develop IT services that support these efforts.

Mission & Vision

We would like to explain our mission and vision:

Mission: To create a world where people help each other in times of need.

Vision: To create a world where end-users are not troubled by system incidents.

During my bachelor’s and master’s, I dedicated myself to studying earthquakes. In 2011, just before my graduation, the Great East Japan Earthquake occurred and I was devastated by the tragic scenery that the whole part of the towns were completely swept away by the tsunami. However, I was also deeply moved by the volunteers from all over Japan and the world as well who came over to help, sacrificing all they had.

Afterwards, when I joined a major system integrator and faced a major system failure, a lot of people gave me a hand. This gave me an insight that if Japan and the world could cooperate as they do during natural disasters, the tragic impact of system incidents on end-users could be reduced. While quality control and other measures to prevent failures have been well-researched, there is still much room for improvement after system failures occur. I believe that helping each other in emergencies can make things even better.

What is Co-Trou-Sh?

     Co-Trou-sh is the abbreviation of the word Co-Troubleshooting, which means tackling system failure responses through cooperation.

    This media is targeted at those who are struggling with system failure responses, offering solutions to challenges such as ‘lack of improvement progress,’ ‘numerous/complex alerts,’ and ‘person-dependent failure responses.’ It provides ‘methods of improvement, ‘response know-how,’ and ‘case studies.’

「3カ月で改善!システム障害対応 実践ガイド」著者

野村浩司:金融システムの開発保守運用と改善を12年担当。7年にわたり合計約 1000 件の障害事例を分析。システム障害対応の改善では、アラートを9割削減。

  • nomura野村浩司