Skocz do zawartości

Debugowanie Sprzetu


fafig

Rekomendowane odpowiedzi

ma ktos pomysl jak wymyslic co w kompie powoduje mi ciagle zawieszania? skanowalem memtestem cala pamiec dzisiaj. zainstalowalem sobie fedore na innej partycji, pochodzila 15 min, zwis, wlaczylem centosa, zostawilem tak na jakis czas, wracam - zwis. zadnej informacji w zadnym logu, nic, zero. co ciekawe cdrom zaczal sie dziwnie zachowywac, poruszalem wtyczka - chodzi. moze to sprawa wyrobionych koncowek do sata i w konsekwencji braku laczenia ( niby nie odlaczalem tego za czesto). tak dla pewnosci odkurzylem dzisiaj wnetrze przedmuchalem wtyczki i popryskalem "kontaktem". moze dysk po prostu pada? pomyslow za bardzo juz nie mam. sterowniki wylaczylem nvidii, wyglada na to ze to moze byc kwestia sprzetu. to samo sie dzieje na kazdym jednym kernelu. mam jedna hipoteze jeszcze - na fedorze wywalilo mi kerneloopsa z smp, moze to wina jakiegos governora do obslugi cpufreq, ale tez nie jestem pewien i nawet nie wiem juz gdzie szukac.

 

z gory dzieki za pomysly.

 

logi smartctl

 

/dev/sda

 

smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright ? 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3320620AS
Serial Number:    3QF08V9R
Firmware Version: 3.AAD
User Capacity:    320,072,933,376 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Aug 18 20:21:36 2009 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 ( 430) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 115) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000f   108   094   006    Pre-fail  Always       -       14958294
 3 Spin_Up_Time            0x0003   094   090   000    Pre-fail  Always       -       0
 4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1187
 5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       600604945
 9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       14744
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1226
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   062   054   045    Old_age   Always       -       38 (Lifetime Min/Max 37/39)
194 Temperature_Celsius     0x0022   038   046   000    Old_age   Always       -       38 (0 12 0 0)
195 Hardware_ECC_Recovered  0x001a   063   054   000    Old_age   Always       -       218204681
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       284
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 269 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 269 occurred at disk power-on lifetime: 2822 hours (117 days + 14 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 1e 54 42 00 e0  Error: ICRC, ABRT 30 sectors at LBA = 0x00004254 = 16980

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 25 00 3f 33 42 00 e0 00      00:18:09.053  READ DMA EXT
 25 00 3f 3f 00 00 e0 00      00:18:09.053  READ DMA EXT
 25 00 3f 2b 44 00 e0 00      00:18:09.050  READ DMA EXT
 25 00 3f 33 42 00 e0 00      00:18:09.049  READ DMA EXT
 25 00 3f 3f 00 00 e0 00      00:18:09.048  READ DMA EXT

Error 268 occurred at disk power-on lifetime: 2822 hours (117 days + 14 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 00 00 00 00 e0  Error: ABRT at LBA = 0x00000000 = 0

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c4 00 60 98 42 5e e2 00      00:17:12.231  READ MULTIPLE
 c4 00 08 90 42 5e e2 00      00:17:12.223  READ MULTIPLE
 c4 00 08 50 db f3 e2 00      00:17:12.069  READ MULTIPLE
 c4 00 08 48 db f3 e2 00      00:17:12.068  READ MULTIPLE
 c4 00 08 90 34 f1 e2 00      00:17:12.050  READ MULTIPLE

Error 267 occurred at disk power-on lifetime: 2821 hours (117 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 00 00 00 00 e0  Error: ABRT at LBA = 0x00000000 = 0

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c4 00 60 19 42 c5 e1 00      00:08:46.163  READ MULTIPLE
 c5 00 08 69 ae 07 e0 00      00:08:46.160  WRITE MULTIPLE
 c5 00 08 18 d3 fe e2 00      00:08:46.160  WRITE MULTIPLE
 c5 00 08 38 b2 fc e2 00      00:08:46.190  WRITE MULTIPLE
 c5 00 08 b8 b1 fc e2 00      00:08:46.190  WRITE MULTIPLE

Error 266 occurred at disk power-on lifetime: 2821 hours (117 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 00 00 00 00 e0  Error: ABRT at LBA = 0x00000000 = 0

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c4 00 80 00 de 0e e0 00      00:08:39.522  READ MULTIPLE
 c4 00 80 80 dd 0e e0 00      00:08:39.513  READ MULTIPLE
 c4 00 80 00 dd 0e e0 00      00:08:39.505  READ MULTIPLE
 c4 00 18 79 ea b6 e1 00      00:08:39.496  READ MULTIPLE
 c4 00 40 31 ea b6 e1 00      00:08:39.488  READ MULTIPLE

Error 265 occurred at disk power-on lifetime: 2821 hours (117 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 00 00 00 00 e0  Error: ABRT at LBA = 0x00000000 = 0

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 c4 00 80 00 bb 0e e0 00      00:08:32.319  READ MULTIPLE
 ec 00 00 00 00 00 a0 02      00:08:32.316  IDENTIFY DEVICE
 ef 03 08 00 00 00 a0 00      00:08:32.316  SET FEATURES [set transfer mode]
 ec 00 00 00 00 00 a0 02      00:08:32.313  IDENTIFY DEVICE
 00 00 80 00 00 00 00 06      00:08:32.186  NOP [Abort queued commands]

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

/dev/sdb

 


smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright ? 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD103UJ
Serial Number:    S13PJ90QB40988
Firmware Version: 1AA01113
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Tue Aug 18 20:21:43 2009 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
				was never started.
				Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (11388) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 191) minutes.
Conveyance self-test routine
recommended polling time: 	 (  20) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0007   077   077   011    Pre-fail  Always       -       7720
 4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       193
 5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
 8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       0
 9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       4995
10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       186
13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   068   000    Old_age   Always       -       29 (Lifetime Min/Max 29/30)
194 Temperature_Celsius     0x0022   070   068   000    Old_age   Always       -       30 (Lifetime Min/Max 29/30)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       6669
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   099   099   000    Old_age   Always       -       5
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Odnośnik do komentarza
Udostępnij na innych stronach

Zawieszenia to najtrudniejsze do rozpracowywania błędy... na Twoim miejscu zająłbym się sda, oczywiście nie znam się na tym, ale jak masz Errory to może być powód problemu.

 

Gdybyś chciał debugować jądro, to nie powinno być kłopotu ze znalezieniem w sieci jak to się czyni (kernel hacking/debugging). Musisz sobie przekompilować jądro włączając kilka fajnych rzeczy w sekcji kernel hacking i do dzieła... może uda Ci się pojechać na samych symbolach + magic SysRq, uderzyć SysRq+Alt+p i dostać nazwę przywieszonej funkcji.

 

Sorry za pytanie, ale muszę je zadać... testowałeś bez X-ów prawda?

 

Dobra wiadomość jest taka, że możesz znaleźć (i naprawić) błąd w jądrze Linuksa, ja bym się cieszył, powodzenia :)

Odnośnik do komentarza
Udostępnij na innych stronach

jeden problem to ja mam, na fedorze wyskakiwal blad SMP, czyli mozliwe ze wina kontrolera apic. np na module kvm centos dostawal lock na rdzenie kolejno 0 i 1. mozliwe ze po prostu apic cos nawala mimo ze nigdy nie musialem go wylaczac. zawsze mi sie wydawalo ze to sie powinno wylaczac dopiero jak jest kernel panic, no ale pewnie to sa tez objawy problemow z apiciem. na kerneltap wyczytalem ze oni cos popsuli w obsludze od kernela 2.6.9 i tak sie to ciagnie. z drugiej strony producenci plyt glownych (w szczegolnosci consumer electronics) implementuja apic niezgodnie ze standardem - stad problemy. nastepny komp zloze na plycie serwerowej jakiegos tyana wezme albo cos, bo w sumie czego oczekiwac od plyty za 350zl z ucietym biosem. heh w dmesg pisze zebym sobie nume wlaczyl, tylko ciekawe jak, skoro takiej opcji nie ma. jednak na plycie nie ma co oszczedzac...tak czy owak komp przeszedl czyszczenie wszystkich podzespolow, mozliwe ze cos nie laczylo. wlaczylem pelny test smarta na hdd sprobuje jutro potestowac bez apica. na listach dyskusyjnych niektorzy ludzie tez zglaszaja takie problemy. np nawet na centosie 4 ktos pisal ze dostaje takie zwisy, w losowym czasie, pod losowym obciazeniem - zasugerowal ze to moze wina sprzetu byc. raczej nie bede debugowal kernela z prostej przyczyny - nie za bardzo potrafie grzebac w takich rzeczach, tymbardziej ze 3ba by jakas ksiazke o kernelu poczytac wpierw....

 

tak przy okazji to przypomnialo mi sie ze kiedys na starym komputerze (p3 933) uruchamialem archa i tez sie zawieszal - po jakichs 15 minutach w konsoli. sam juz nie wiem co o tym sadzic...tak czy owak dzieki za odzew

Odnośnik do komentarza
Udostępnij na innych stronach

Jeśli chcesz dodać odpowiedź, zaloguj się lub zarejestruj nowe konto

Jedynie zarejestrowani użytkownicy mogą komentować zawartość tej strony.

Zarejestruj nowe konto

Załóż nowe konto. To bardzo proste!

Zarejestruj się

Zaloguj się

Posiadasz już konto? Zaloguj się poniżej.

Zaloguj się
×
×
  • Dodaj nową pozycję...