
Troubleshooting a Cisco 6500 crash
March 10, 2010I was asked recently to share some knowledge about the support of the Cisco 6500 switches as the information available on the DOC-CD could be fairly overwhelming.
As it happens a clients Cisco-6509 switch fell over yesterday. I was called out to address the issue of the Cisco-6509 that decided it was tired of life by rebooting itself. I’ll go through some of the steps I did to find the root cause. Obviously note the steps listed here will not find the cause of every possible issue with a 6500 switch, but can be used as a guideline.
Usually the first thing I would do is to see the reason for the reboot with a “sh version”. Look at the highlighted lines.
ndcbbnpendc0103#sh ver Cisco Internetwork Operating System Software IOS (tm) s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(18)SXF6, RELEASE SOFTWARE (fc1) Technical Support: http://www.cisco.com/techsupport Copyright (c) 1986-2006 by cisco Systems, Inc. Compiled Mon 18-Sep-06 23:32 by tinhuang Image text-base: 0x40101040, data-base: 0x42D90000 ROM: System Bootstrap, Version 12.2(17r)SX5, RELEASE SOFTWARE (fc1) BOOTLDR: s72033_rp Software (s72033_rp-ADVENTERPRISEK9_WAN-M), Version 12.2(18)SXF6, RELEASE SOFTWARE (fc1) ndcbbnpendc0103 uptime is 3 hours, 23 minutes Time since ndcbbnpendc0103 switched to active is 3 hours, 22 minutes System returned to ROM by s/w reset at 00:14:27 PDT Wed Sep 20 2006 (SP by bus error at PC 0x402DC89C, address 0x0) System restarted at 09:13:44 ZA Wed Mar 10 2010 System image file is "disk0:s72033-adventerprisek9_wan-mz.122-18.SXF6.bin"
Obviously it is clear that the switch did a software reset caused by ‘bus error at PC 0x402DC89C, address 0×0‘.
You can see it was caused by a system bus error. A system encounters a bus error when the processor tries to access a memory location that either does not exist (software) or does not respond properly (hardware). The memory location that this router tried to access was ’0×0′. Do not confuse this with the program counter (PC) value above. With the address accessed by the router when the bus error occurred, the command “show region” could be used to determine the memory location the address corresponds to.
If the address falls within one of the ranges in the “show region” output, it means that the router was accessing a valid memory address, but the hardware corresponding to that address is not responding properly. This would indicate a hardware problem.
If the address reported by the bus error, does not fall within the ranges displayed in the “show region” output, it means that the router was trying to access an address that is not valid. This indicates that it is a Cisco IOS Software problem. From the output below it is clear that ’0×0′ does not any memory region.
ndcbbnpendc0103#sh region
Region Manager:
Start End Size(b) Class Media Name
0x08000000 0x0BFFFFFF 67108864 Iomem R/W iomem
0x08B69D40 0x08C77813 1104596 Criti R/W iomem:Critical I/O
0x40000000 0x4BFFFFFF 201326592 Local R/W main
0x40101040 0x42D8FFFF 46723008 IText R/O main:text
0x42D90000 0x430A83BF 3244992 IData R/W main:data
0x430A83C0 0x44AAE4DF 27287840 IBss R/W main:bss
0x44AAE4E0 0x4BFFFFFF 123018016 Local R/W main:heap
0x50000000 0x7FFF7FFF 805273600 Local R/W more heap
0x52A11DC0 0x538A42EB 15279404 Criti R/W more heap:Critical Processor
0x80000000 0x8BFFFFFF 201326592 Local R/W main:(main_k0)
0xA0000000 0xABFFFFFF 201326592 Local R/W main:(main_k1)
The output of the “show stacks” command could then be used to identify the Cisco IOS Software bug that caused the bus error. It might be a bit overwhelming with all the garbish it spits out, but you will get used to the output soon enough. Alternatively you can use Cisco Output Interpreter to decode the output. I have posted the relevant portion here:
<pre>ndcbbnpendc0103#sh stack --omitted--</pre> Mar 9 17:01:58: %DIAG-SP-6-BYPASS: Module 4: Diagnostics is bypassed Mar 9 17:01:58: %OIR-SP-6-INSCARD: Card inserted in slot 4, interfaces are now online Mar 10 09:10:56: %C6K_PLATFORM-SP-2-PEER_RESET: SP is being reset by the RP %Software-forced reload Breakpoint exception, CPU signal 23, PC = 0x402DC89C -Traceback= 402DC89C 402DA828 40435C38 40436DF8 404243B8 40424510 402CF4DC --omitted--
If you Google the PLATFORM-SP-2 error, you should find the following :
Condition: Relates to WS-SUP720-3B running Cisco IOS Release 12.2(18)SXF2. The trigger for the crash is unknown.
Workaround: There is no workaround.
What have we established so far?
- A system bus error occurred when the processor tried accessing something that doesn’t exist. This points to a bug with current IOS version the switch is running.
- According to the bug description the trigger that caused the crash is unknown. And there is no published workaround.
Where to from here? Obviously it is safe to assume that an IOS upgrade should rectify the problem. But in production life is not that simple or quick. To upgrade the IOS of an in-production device usually requires some painful process for change control which can take some time.
How do you prevent this from happening again until the IOS upgrade? Well you need to know what triggered the IOS bug causing the 6500 to go belly up. For this the crashinfo is vital. The crashinfo should tell us exactly what happened right before the software reload. Again the output from this can be overwhelming.
Using the command “more {location}:{crashfile}, you will see a of list commands and logging events that happened. What you looking for the very last event usually before the traceback. Look at the highlighted lines:
ndcbbnpendc0103#more bootflash:crashinfo_20100310-071049 --omitted-- CMD: 'sh crypto isakmp sa vrf vpnafg' 09:10:11 ZA Wed Mar 10 2010 CMD: ' sh run int vlan1188' 09:10:34 ZA Wed Mar 10 2010 CMD: 'conf t' 09:10:37 ZA Wed Mar 10 2010 CMD: 'interface Vlan1188' 09:10:42 ZA Wed Mar 10 2010 CMD: 'no crypto map vpnafg-dtt-map redundancy VPNHA' 09:10:49 ZA Wed Mar 10 2010 Address Error (load or instruction fetch) exception, CPU signal 10, PC = 0x42172D1C -Traceback= 42172D1C 42173324 4217341C 4217348C 42173DE4 4217B710 4217B470 4114FAD8 4113CEC4 4045C740 41047A68 4046AB48 4102F70C 4102F6F8 $0 : 00000000, AT : 430A0000, v0 : 53A8EF94, v1 : 00000000 a0 : 5283DE80, a1 : 291C4A3D, a2 : 0D0D0D0D, a3 : 410156BC t0 : 00000010, t1 : 00000010, t2 : 00000000, t3 : FFFF00FF t4 : 41D4CA58, t5 : 458AEDC8, t6 : 458AEDC4, t7 : 458AEDC0 s0 : 5310C4B4, s1 : 483DCC00, s2 : 483DCBF0, s3 : 00000010 s4 : 458C4CA4, s5 : 00000000, s6 : 00000000, s7 : 00000000 t8 : 44AC1848, t9 : 00000000, k0 : 475ACDB0, k1 : 41D52C68 gp : 430AE700, sp : 54327600, s8 : 43800000, ra : 42173324 EPC : 42172D1C, ErrorEPC : BFC2A65C, SREG : 3400F103 MDLO : 00000002, MDHI : 1D59CAA0, BadVaddr : 0D0D0D19 Cause 80000C10 (Code 0x4): Address Error (load or instruction fetch) exception --omitted--
Here you can see clearly when the crypto map was removed of interface VLAN-1188, the 6500′s IOS choked. To prevent this, either lock that command via TACACS or instruct everyone not to use that command again, until such time that the IOS gets upgraded.
I hope this provides a good insight how to deal with a 6500 crash
Great Explanation as always Ruhann!
I see your point SXF6 is a couple of years old. I feel your pain trying to upgrade the old code to Safe Harbor.
thanks buddy
To be exact its from 2006, and riddled with IOS bugs
Nice explanation. Can you please advise how you learnt this information. I don’t believe there is a whitepaper that explains it so simply. There are quite a few other troubleshooting commands which we are always told only Cisco Tac can debug. It would be neat to know and interpret these commands and if there is a resource out there that gives this information it would be very helpful. thx
Thanks
Unfortunately there is no magic site I used other than CCO. Would be nice if there was. What I know is what I encountered in the last 4 years. This is the biggest reason I blog, to share knowledge like so many others out there