[NLPL Task Force (A)] Fwd: DellEMC Technical Support - [SRNumber:955846133]

Mon Nov 6 10:33:31 UTC 2017

dear all,

just fyi: ‘ls.hpc’ (and, thus, the NLPL wiki) is back on-line now,
though there are remaining issues with the RAID controller, and we
will have to take the machine off-line once again sometime later this
week.

best, oe

---------- Forwarded message ----------
From: Kjell Andresen <kjell.andresen at usit.uio.no>
Date: Mon, Nov 6, 2017 at 11:07 AM
Subject: RE: DellEMC Technical Support - [SRNumber:955846133]
To: A.OConnor at dell.com
Cc: oe at ifi.uio.no, hpc-core at usit.uio.no, Sam Phung
<sam.phung at usit.uio.no>, adrian.helle at usit.uio.no, Kjetil Kirkebø
<kjetil.kirkebo at usit.uio.no>

[root at ls ~]# dmidecode -s system-serial-number
620SJ32

Hi all!

Thank you all for time and hands helping out with ls.hpc.

Status now:
ls.hpc is up running on the 4 system disks in raid5,
(The 4 x 1TB HDDs). The SSDs still have an issue, see below.

The HDDs is the only volume groups visible in the OS [1].

In the console/iDrac both raid5s are showing [2].

There are one SSD, 0:1:7 blinking both green and amber at the same
LED. It's status is a little bit strange to me, check
http://folk.uio.no/kjell/usit/ls/IMG_3899.JPG
(It's another LiteOn IT ECE 40 disk)

Aaron: I think it would be wise to replace the SSD 0:1:7 before
removing the ssdhs-vg again and remake it as raid5 with one hot spare.
I have collected a new TSR with RAID Controller Log at
http://folk.uio.no/kjell/usit/ls/TSR20171106105543_620SJ32.zip

oe: Last time I removed the ssd virtual disk we lost the virtual hdd
disk so I think it is wise to wait for tonights backup of the mounted
filesystems and replacing the ssd 0:1:7 before removing the ssd-vg
again?

We had some problems entering into to console this morning also after
entering the root password on the machine [3]
This was ok after powering off the machine and leaving it without any
power for some minutes.

I leave ls.hpc.uio.no up running as is until further notice (the
SSDs), the local mounted disks are listed in [4]

/Kjell

[1]
Status of ls.hpc now:
----------------------------------
[root at jump-ojd kjell-drift]# ssh ls.hpc
Last login: Mon Nov  6 09:16:55 2017 from jump-ojd.uio.no
[root at ls ~]# uptime
 09:54:09 up 3 min,  1 user,  load average: 0.93, 0.64, 0.26
[root at ls ~]# vgs
  VG       #PV #LV #SN Attr   VSize VFree
  internvg   1   8   0 wz--n- 2.73t 1.05t

[root at ls ~]# uname -r; grubby --default-kernel
2.6.32-696.13.2.el6.x86_64
/boot/vmlinuz-2.6.32-696.13.2.el6.x86_64

[2]
The status of the vdisks in ls.hoc now:
http://folk.uio.no/kjell/usit/ls/ls-lc-vdisks.png

[3]
Error after rebooting ls.hpc from the console this morning:
http://folk.uio.no/kjell/usit/ls/IMG_3897.JPG

[4]
Mounted disks at ls.hpc now:
-----------------------------------------
[root at ls ~]# df -lh
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/internvg-root
                      2.0G  530M  1.3G  29% /
tmpfs                 190G  4.0K  190G   1% /dev/shm
/dev/sdb1             488M  107M  356M  24% /boot
/dev/mapper/internvg-opt
                       31G  3.3G   26G  12% /opt
/dev/mapper/internvg-tmp
                       51G  224M   48G   1% /tmp
/dev/mapper/internvg-usr
                      152G  3.7G  141G   3% /usr
/dev/mapper/internvg-var
                      150G   14G  129G  10% /var
/dev/mapper/internvg-system
                       99G  1.9G   92G   2% /usit/ls/system
/dev/mapper/internvg-scratch
                     1008G  216G  742G  23% /usit/ls/scratch