Latest

Tuesday, March 17, 2020

Troubleshooting Methodology and Soft Skills

Four Fundamental Questions for effective Problem Analysis


THESE QUESTIONS SHOULD BE ASKED FOR EVERYSERVICE REQUEST WHERE A PROBLEM IS REPORTED (~75% of SRs)
  1. What evidence do we have?
  2. When did the problem start?
  3. Do you have any other systems that could have this problem but do not?
  4. How many objects have the problem vs. how many do not?
Caution! These are Problem Analysis questions….so, what is a Problem? How do we know we are working on a Problem?

A  Problem is a deviated condition resulting in the unexpected behavior of a System. 

Problem Lifecycle



Situation Appraisal
First of all we need to understand the Situation , for that we can ask below questions;
  • What is this case about?
  • try to explain the issue in your native language
  • Ask “obvious” questions
  • (use “We vs. “You”)
  • Understand the concern
Problem Analysis
Next step is to analyze the problem, for that we need to ask below questions;
  • Is this a problem?
       Problem: deviation from unexpected behavior
  • How is the problem determined or measured (evidence)
       Is this measurement reliable? (Trust but Verify)
  • When did the problem start?
  • Is there a similar device that does not have the problem?
  • How widespread is the problem
       How many devices see the problem vs. how many do not

Problem Analysis – Evidence Reliability
Actual problem or unknown expected behavior?
Do not rely on second hand information (socialization). Unless the information comes from a known Subject Matter Expert or is supported by hard evidence (EDCS, RFC, system outputs, etc)
§  Work on facts not opinion;
§  Be willing to fact check
§  Humans are error prone
§  Assumptions -“This has always worked”
    Unintentionally misleading

Problem Analysis – Evidence Accuracy
§
§  How are the symptoms observed?
§  How accurate or precise is the measurement?
§  How is impact measured?
-        User reports vs. Packet captures
§  What is the customer’s expected behavior?
-        Is this the same as the designed intent?
Problem Recursion Example #1
“5 Whys” for Effective Causal Analysis; Be Curious!


Problem Recursion Example #2 

What evidence do we have?...that tells us there is a problem?
Two major reasons why evidence important:
  1. As a troubleshooter, we want to work with FACTS, not OPINION.
  2. As a troubleshooter, we need to understand if we truly have a DEVIATION or if our EVIDENCE describes the expected behavior of a system.
Use your senses…what do we see, hear, feel, or taste, that tells us FIRSTHAND there is a deviation?
        Do not rely on human socialization for evidence gathering…
        …unless it is corroborated by multiple reliable sources.
        …unless it comes from someone you believe is a Subject Matter Expert, based on a relationship you have or prior experience.
        …unless it is supported by system output, RFCs, EDCS, user manuals, or some  non-human evidence-based method.
Humans are imperfect…… (Yes, it’s true). Sometimes, as humans….
        …we make assumptions.
        …we present “facts” against irrelevant contexts, sometimes emotional contexts.
        …we present what we believe are “facts”, but are instead blatantly false information. i.e. Sometimes we are wrong (Yes, it’s true).
        …we mislead unintentionally,
        …we unintentionally mix opinions in items presented outwardly as “facts”.
        “Mr. Customer, what is your expectation of how this should work?”
        Does the customer’s expectation of desirable system behavior match the system’s design-intent? Be skeptical.
Access the customer’s system….. to diagnose, isolate, and frame the context of a problem.
        Use Web-Ex (asynch, requires customer resources)
        Even better, server-to-server SSH, telnet, VPN, what does customer have available? Always ask about synchronous access availability (you’ll be surprised).
        See the data firsthand! “Are any objects that work properly? May I access the working object”?
        See the data firsthand! (again)
        Don’t use PING-PONG method if you can help it.
        Never present anything except pure FACTS supported by evidence to internal SMEs.

When did the problem start?
Why is understanding problem start time important?
  1. Has it ever worked?
a)      If so, we are necessarily looking for a change.
b)      If not, what is customer expectation of how it should work and is this expectation realistic? Is there a compelling reason that it SHOULD work?
c)      When did the problem start vs. when the system was 1st deployed. Are we dealing with a Day 1 or Day 2 problem? Each requires different analysis….
  1. Historical Log analysis or adjacent device analysis.
a)      Even if you know absolutely nothing about a platform, protocol, or system, examining log files against the backdrop of problem start time can provide an outstanding learning experience and usually yield a decent Working Theory.
b)      Temporal comparative analysis. What was happening in, on, or around the system before the problem started vs. after the Trigger event?
  1. Trigger Theories are always based on time!
Do you have any other systems that couldhave this problem but do not?
Why is this important?
        Without the requisite knowledge and experience, this may give us a compelling reason to believe that the system SHOULD work differently compared to what is reported.
        Allows for comparative analysis; what distinctions are there between the deviating object on the failing system and the working system?  Configuration, platform/hardware versions, software versions, external variables such as connected devices, etc.
        Comparative analysis is useful even when you lack the knowledge/experience related to a technology.  Note ANY distinctions; without K&E some differences may not be important to you, but they may be important to someone else!
        In many investigations, understanding why a problem is not happening on a system will uncover why  it ishappening on the failing system.
        Define deviating “object”; may not always be a platform, or specific feature.  It might be a session, output printed from a show command, MIB object value, VoIP call, etc.
        Once we discover what object to focus on, we can readily assess where else we might observe the object on other systems to determine if others are behaving differently
How many systems do have the problem vs. how many do not?
Why is this important?
        Immediate determinant of business impact [You just bought 50 of these and the feature you bought them for is failing on all of them?]
        Can use this data to judge the potential for scaled/trended comparative analysis between failing system(s) and working system(s) -  [Only 1 is broken, but you have 20 others that work fine?  Let’s check those out…]
        Allows you to judge how feasible it may be to reproduce in-house, based on how common the failure is in a production environment (How elusive might the trigger condition(s) be?
        May allow for workaround opportunities – can we use or divert traffic to working systems while we investigate the failure?

Topic and Knowledge-base
        Topic problem #1: Non-reputation based (there is no authenticity check for data, it could be untrue) .
        Topic problem #2: Dated information, what was true in 2008 (or last month) may not be true today!
        A Feature X wasn’t supported in January 2012. BU introduces support for Feature X in February 2012. A $2M USD deal was sold based solely on Feature X in March 2012.  Customer opens SR for help configuring Feature X in April 2012. http://topic contains multiple posts stating that Feature X isn’t supported, all dated 2011. What would happen if you repeated what you read in Topic to this customer?  The Customer and Account Team and BU would be rightfully angry. 

  VERY IMPORTANT NOTE: Do not be the person that blindly repeats out-of-context  information discovered in Topic. Verify your situation and it’s context.
        Know when to let it go; what you’re seeking might not be there.  Don’t go down a rat hole hoping you’ll find a miracle. If find a “solution” in Topic is you’re only plan, you may need to regroup.
        Be careful about context ; many variables influence a stance from engineering on a technical issue.  Be *certain* information you uncover via Topic or other tools applies to your situation – both in newsgroup info, or determining how applicable a DDTS may be to your investigation.
        If there are distinctions, MAKE NO ASSUMPTIONS.  Follow-up to get a current answer that applies to your situation as needed.

Account Teams (Enlist Them in your Army)
               The Account Team is not your enemy; they are an asset during a problem investigation
        TAC CSEs can work with the Account Team in the event that a problem was a result from a design issue or system limitation
        Can collaborate to come up with a unified message on how best to move forward
        Proactively alerting them to these issues avoids mixed messaging – what if we sold this product/design to fit their needs and this problem was a result of it?
        Account Team may be able to leverage relationships with the customer during investigations where we can’t reproduce the problem in-house, but we NEED customer’s active participation to solve this problem
        May even be able to use account team resources to assist in on-site data collection
        Relationship they share can be leveraged to help set customer expectations about how difficult problem can be elusive and may require a lengthy investigation
        Proactively alerting the Account Team of hot issues can allow all Cisco parties to agree on a message in the event that the customer escalates through other channels outside of TAC
        When you need back-up to acquire an engineering resource to aid in an investigation or resolve a defect, you can leverage the account team to help quantify the business impact to the customer and future business risk to Cisco to compel engineering resources to engage.

How to Address Non-Problems
        Indicators of non-problems, and ways to address them when lacking knowledge and experience.
        “How do I configure this ?”
        Reference documentation – CCO/EDCS, play in the lab,  Ask an Expert
        There is someone at Cisco who can answer your question in 5 minutes…the challenge is finding that person and building a relationship
        “Help with a new installs…”
        TAC DOES help with new installs – there is a Config Assistance problem code for a reason
        Grey area, but if customer can define exact goal/design and is looking for config assistance, we should be able to help
        When needed, if customer can’t define goal/design, collaborate with account team, AS, etc. to help provide a solution
        “window for maintenance support….”
        What concerns do we have about the maintenance window being successful?
        Can we go through the MOP in a lab environment proactively to avoid potential problems?
        “What is the best way to…”, “What product should I use to….”, “What version of software should I use” Provide disclaimer that your opinion stems from knowledge and experience gained in a support environment
        For scaled “how best?” questions that TAC may not be best suited to address, reach out the customer account team, AS, and/or engineering  marketing resources  to help provide a solution.

5-Minute Kepner-Tregoe “Spec” for effective Problem Analysis 
Documenting the Problem Investigation
        Why is it important to document the investigation?
        Asking good contextual questions will result in a LOT of data – data that you will forget. 
        Knowledge re-use – we talked about how we need to be skeptical about data found in topic – how can we document problems better?
        What if another CSE, account team, escalation resources, or engineering looks at your SR and needs to get up to speed?
          What to document?
        A narrow problem statement identifying the failing object and the suspected problematic behavior it is exhibiting
        Be as specific as possible – include exact hostnames, hardware components, software versions when applicable
        Link to or document evidence collected showing the behavior
        Document time frames for problem occurrences
        Document comparative analysis between failing systems and working systems, noting any distinctions
        Note business impact
        Document action items – not only for Cisco but for the customers or third parties as needed
        Document working theories when developed, and how your action plan will rule them in or out



No comments:

Post a Comment

What do you think about it