[Matrix server] Upcoming maintenance and backup restore July 31

Milan@discuss.tchncs.de · edit-2 1 year ago

[Matrix server] Upcoming maintenance and backup restore July 31

erAck@discuss.tchncs.de · 1 year ago

Milan, in compensation for all this hassle you should take the next three Mondays off.

Milan@discuss.tchncs.de · 1 year ago

i like that idea

Courbet_eiro@discuss.tchncs.de · 1 year ago

Thank you so much for all the work. I have tried on many occasions to create an account in matrix these last days and it has been impossible for me. I never get the email verification.

Again, thanks for all the work you do.

Milan@discuss.tchncs.de · 1 year ago

the email settings have had been readded a bit late, sorry for the confusion.

Haui@discuss.tchncs.de · 1 year ago

Thanks for putting in the work. Is there anything we can help you with? From what I understood the domain is german, is the server in germany as well? I‘m located in germany and do sysadmin work. Fighting with hosting companies is part of my job. ;) let me know if I can do anything. Have a good one!

Milan@discuss.tchncs.de · 1 year ago

Thank you :) Well i am not sure if there was something to fight over except maybe some sort of refund… for now it seems to be fine one the new machine. – yes, i am from germany, however i think its a helsinki dc from hetzner.

Haui@discuss.tchncs.de · 1 year ago

You’re very welcome. Hetzner is generally a good host afaik. It does depend on the configuration I suppose. Are you using the shared vps or something else? If the storage is guaranteed (as in not custom hardware) they are technically responsible for its condition. A host I‘m working with (also located at hetzner but in falkenstein) does 2 backups a day which also prevents having to revert far back.

Milan@discuss.tchncs.de · 1 year ago

on hetzner its all dedicated servers – out goes an ax51-nvme, in comes an ax102. they have tried a connector cable swap in order to try to bring the nvme(s) back to life, i was wondering if this could have something to do with the smart errors logged and the temp zpool errors, however i think the cpu upgrade now at least is very welcomed by the matrix server 😅

Haui@discuss.tchncs.de · 1 year ago

Hm. In that case I‘m not sure what their obligations are. It’s very rare that I hear of nvmes downright failing.

If your smart error rates start going up, that is a clear indicator that something is gonna happen. I have a graph on my server showing the error rates. Actually, there is a „bad sectors“ or „reallocated sectors“ reading that should be more telling. Once they go up its critical I think.

I didn’t even know you also ran a matrix server. I recently started looking into matrix but I cant really say anything yet. Is it federated as well? Or do you need to make a new account for each one?

Milan@discuss.tchncs.de · edit-2 1 year ago

Yes, it is federated – however since there is no SSO on the Lemmy instance, you need to make a new account. Like you need to make new accounts between email providers. :) However it is a different federation protocol: Matrix vs ActivityPub. For more cool stuff, check out https://tchncs.de :3

Haui@discuss.tchncs.de · 1 year ago

Cool! Thanks! I will check it out.

Milan@discuss.tchncs.de · edit-2 1 year ago

Dang the old host was deleted from the monitoring – however looking on at least one smart thing from my emails, there were no errors logged before the drives gave up on life during replacement. They just had a ton read/written and the used counter at 255% (even tho rw and age were not equal, its weird and one reason why i wanted to have at least one replaced in the first place). This is the one that had more:

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        53 Celsius
Available Spare:                    98%
Available Spare Threshold:          10%
Percentage Used:                    255%
Data Units Read:                    7,636,639,249 [3.90 PB]
Data Units Written:                 2,980,551,083 [1.52 PB]
Host Read Commands:                 87,676,174,127
Host Write Commands:                28,741,297,023
Controller Busy Time:               705,842
Power Cycles:                       7
Power On Hours:                     17,437
Unsafe Shutdowns:                   1
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               53 Celsius
Temperature Sensor 2:               64 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

The new ones now, where the zpool errors happened look like this

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        24 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    3%
Data Units Read:                    122,135,021 [62.5 TB]
Data Units Written:                 31,620,076 [16.1 TB]
Host Read Commands:                 1,014,224,069
Host Write Commands:                231,627,064
Controller Busy Time:               3,909
Power Cycles:                       2
Power On Hours:                     117
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      4
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               24 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          4     0  0x0000  0x8004  0x000            0     0     -

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        24 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    2%
Data Units Read:                    153,193,333 [78.4 TB]
Data Units Written:                 29,787,075 [15.2 TB]
Host Read Commands:                 1,262,977,843
Host Write Commands:                230,135,280
Controller Busy Time:               4,804
Power Cycles:                       11
Power On Hours:                     119
Unsafe Shutdowns:                   5
Media and Data Integrity Errors:    0
Error Information Log Entries:      14
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               24 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         14     0  0x100d  0x8004  0x000            0     0     -

Haui@discuss.tchncs.de · 1 year ago

You‘re not telling me telling me you‘re reading 62 TB in 117 hours, right? Right? xD the old ones were even petabytes.

Those numbers are just insane. I have worked with AI training and storage. I have never seen such numbers.

Well, I suppose that nvme was very much eol. Now I understand the behavior. This many operations in such a short time will put serious strain on your system. No wonder parts can give up. Are you using a raid config? Sorry if you already mentioned it.

Milan@discuss.tchncs.de · edit-2 1 year ago

i am not sure about those numbers on the new ones … it was one db restore and a few hrs of uptime … a scrub… , then i rsynced some stuff over and since then the thing is in idle 🤷

sample of the current active system … i think at time of arrival it was 2+tb written or something

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    88,116,921 [45.1 TB]
Data Units Written:                 43,968,235 [22.5 TB]
Host Read Commands:                 689,015,212
Host Write Commands:                409,762,513
Controller Busy Time:               1,477
Power Cycles:                       4
Power On Hours:                     248
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               37 Celsius
Temperature Sensor 2:               46 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

GVasco@discuss.tchncs.de · 1 year ago

Awesome work! Thank you you for all the time and effort you put into this. Let us know if you start to feel the need for some help in managing other aspects of running those instances. Best of luck going forward and looking forward to the future with this instance!

jasondaigo@discuss.tchncs.de · 1 year ago

I feel for you. Is #tchncs:tchncs.de still the correct toon when it’s running fine again ?

Milan@discuss.tchncs.de · 1 year ago

what do you mean?

[Matrix server] Upcoming maintenance and backup restore July 31

[Matrix server] Upcoming maintenance and backup restore July 31

Update

Status august 1st

References

The new ones now, where the zpool errors happened look like this