I'm in the process of emptying a disk shelf on an AFF8080 in order to move to to a newer A700 system...
The AFF8080 is a two node system with disk partitioning 3.8T SSD disks... pretty standard Root-Data-Data partitioning.
One RG on each node sharing three DS224-12 shelfs...
We have emptied one of the two aggregates and we are now in the process of copying around the partitions in order to empty one of the three shelfs...
it all worked fine for two of the three RAID groups, but the last RG seems to stall on us...
We basically run a command like:
disk partition replace -action start -partition 4.1.10.P2 -replacement 1.10.4.P2
And the copy starts which we can see with the "storage aggregate show-status -aggregate DATA02"
And it does indeed show us:
shared 4.1.2 0 SSD - 1.74TB 3.49TB (replacing, copy in progress)
shared 4.0.0 0 SSD - 1.74TB 3.49TB (copy 0% completed)
So far so good...
But... it never get last the 0%... in fact in the event log we can see the following:
event log show
5/27/2020 17:22:31 NETAPP01-02 NOTICE raid.rg.diskcopy.aborted: /DATA02/plex0/rg2: disk copy from 0d.01.2P2 to 4a.00.0P2 aborted at disk block 5248 after 53:38.94. Reason: Disk copy temporarily suspended and will resume automatically..
And we have a lot of these notes and none of them gets bast block 5248... it's been like this for an hour now... (53 mins.)
There is a bit of load on the aggregate...
As you can see there is quite some CPU load on the system... but that's all because of this copy... even though it does not seem to do anything...
In this sysstat check I have stopped the disk replace...
I have managed to replace 15 out of 24... but he last 9 just won't start and hangs as described above...
The raid.resync.perf_impact is set to medium. and I'm not too keen on raising it to high...
There does not seem to be any other errors on the system...
I'm just trying the community before opening a case, maybe someone have the golden key to this? 😉
/Heino