[haiku-inc] Re: Baron is acting up - what to do?

  • From: Oliver Tappe <zooey@xxxxxxxxxxxxxxx>
  • To: haiku-inc@xxxxxxxxxxxxx
  • Date: Sun, 16 Dec 2012 22:51:27 +0100

On 2012-12-16 at 21:44:32 [+0100], Niels Sascha Reedijk 
<niels.reedijk@xxxxxxxxx> wrote:
> 
> On Sun, Dec 16, 2012 at 6:18 PM, Oliver Tappe <zooey@xxxxxxxxxxxxxxx> wrote:
> > Hi there,
> >
> > the two disks that are being used by our dedicated server are acting up,
> > i.e. they report a (slowly) increasing number of bad blocks.
> 
> I am a noob when it comes to hard drive health. Are the SMART messages
> we are seeing merely repeating messages of the same problem? Or are
> they actually showing real time dead sectors?

SMART is sending out repeated messages, but sometimes the message contain 
hints at changed values, which indicate an additional problem (most of the 
time).

> > /dev/sda currently shows 6 bad sectors, some of which seem to have been
> > caused by the AC-breakdown that happened in our data center this summer.
> > /dev/sdb used to be fine until yesterday, now it is reporting 1 bad block,
> > too. The disks should have a considerable amount of spare blocks left, so
> > this isn't really an emergency situation, but (judging from what I've read
> > in our hosting provider's forum) those disks are likely to die sooner or
> > later.
> 
> Is there any recommended action? I understand that disks deteriorate
> after a few years, but how long have ours been alive? Are there any
> diagnostics we can run? Is there a sign that our disks are going bad
> quicker than normal?

Our disks have been alive for three years and AFAICS, these kind of problems 
are nothing unusual for the type of consumer disks that are being used in the 
server (and would be used in the new server, too). It is very difficult to 
say how fast the disks will get worse, though - they could just keep running 
for three more years. But generally, when SMART starts to cough, it is 
recommended to think about how to replace the disk.

Concerning the possible diagnostics: I initiated the RAID1 to check the disks 
yesterday. According to the docs, that procedure will cause the multi-disk 
driver to read all sectors from both drives and compare them, replacing any 
non-readable sector with the (readable) counterpart from the other drive. As 
a result, the pending sectors should have been gone (as overwriting the 
non-readable sector should cause the drive to reallocate a spare sector for 
it). Surprisingly, the number of mismatched sectors after that process had 
finished was still 0 and nothing had changed about the pending and/or 
reallocated sectors. As of now, I don't know why that's the case.

> > Unrelated to the disk problem, there have been talks about increasing the
> > available memory from 8 GB in order to be able to run more VMs (the most
> > likely use I can see for additional VMs would be buildbot slaves).
> 
> I might be wrong but running a resource-intense process like building
> Haiku on a machine that we also use to serve responsive websites just
> sounds like a bad idea.

Well, baron really is idle for most of the time and we can start the buildbot 
VMs with high nice levels (for both CPU and I/O), so I doubt that this will 
have much impact on the important services. The prerequisite for that 
argument is enough memory, of course.

> > On top of that, Hetzner (our hosting provider) has recently decided to let
> > every customer pay 1EUR per month for each additional ipv4-address that
> > they are using. We are currently using 3 additional IPs, so the monthly
> > increase will be 3EUR for us, starting in March 2012.
> >
> > The question is how to address the disk problems, while at the same time
> > taking into account those other topics mentioned above. I think these are
> > our options:
> >
> > 0. Don't do anything until one of the disks is thrown out of the RAID1.
> > When that happens, ask Hetzner to replace that disk (and hope that nothing
> > bad will happen inbetween).
> 
> I would be inclined to pick this option, depending on whether or not
> there are signs that our disks are going bad quickly.

In any case, I'm going to watch the SMART messages closely during the coming 
week ;-)

> > 1. Ask Hetzner to replace both of our disks (one after the other) and do
> > nothing about the memory for now. Costs per month would stay at 49 EUR,
> > starting March 2013 it would be 52 EUR.
> >
> > 2. Let Hetzner replace the disks and ask for an increase of memory to 24
> > GB. The memory increase would incur a setup fee of 49 EUR and would mean
> > that our server (which is an EQ4) would be charged as an EX5, i.e. we'd
> > have to pay 62 EUR/month.
> 
> Why would we upgrade the memory though? Are we experiencing
> performance issues or do we want to deploy new services?

No, performance is fine for now. It's just that in order to potentially 
deploy new services (buildbot slaves, gerrit [most likely as part of vmdev], 
...) we'd need more memory. 
As there are immediate plans for new services, there's no need to get more 
memory *now*, it's just that the disk-problems triggered me to ask for a 
decision and I wanted to put forward all the potential future hosting 
requirements that I'm aware of.

> I am unsure. I am all for the health of our systems, but I do not
> really see the need to upgrade. If the consensus is that we need a new
> machine though, let's go for a cost-efficient option.

It would certainly help to make up our mind about what services we'd want to 
run on baron and to estimate what the requirements of these services are. If 
it's just buildbots, I'm pretty sure that 16 GB of memory would be good 
enough. 
BTW: should we find out that we need the 32 GB at a later time, it's always 
possible to upgrade an EX4 to an EX4s when we need it (for a setup fee of 49 
EUR plus the increased monthly costs of 62 EUR instead of 52 EUR).

cheers,
        Oliver

Other related posts: