RAID Level and Stripe Size

aaronearles

[H]ard|Gawd
Joined
Aug 31, 2006
Messages
2,016
Okay, so I'm putting together a whitebox for ESXi, I have a perc5/i and 6x 1TB 7200 SATA drives, the plan was to put them in RAID10 because I'm not real concerned about capacity, just decent performance. I don't have a specific application in mind, it will mostly be used for windows labs and testing software, nothing intensive but I'd still like to configure it for best performance, and I plan to house my fileserver there too just as a crao datastore but I would like it to be redundant.

So, because I'm leaning more toward performance, not capacity, I was thinking raid10 because of the small writes each os will be making to the drive - figuring parity calculations will slow things down with lots of small writes. Is the performance difference significant enough to give up 2TB on the array, or would you recomment RAID5 instead?

Also, any suggestion on stripe size? OS writes are going to be mostly small stuff, but the vmdks are going to be quite large, but I'm not sure if that matters...

Thanks guys! :D
 
With RAID10, your not giving up 2TB, your giving up 3 btw. (how you typed that makes me assume your putting three 2 disk RAID 1 configurations in RAID 0.) But, if your not concerned about capacity, that would make sense to run for a configuration, 3 drives in RAID0 are plenty fast. If you want to keep as much capacity as possible, and still get great performance, go for the raid5 (its like RAID0 with a parity drive.. not quite as fast though, might still be faster than RAID10 though.. I'm not sure, haven't tested it myself).

As for stripe size, I don't have a comment. I almost always just run default stripe size because my workloads vary incredibly.
 
Thanks for the response.

I was referring to giving up 2TB when comparing to RAID5 not jbod.

RAID5 = 1 disk capacity lost
RAID10 = 3 disk capacity lost

From my research, RAID10 should be faster for writes and better in general for OS and guest use because of the delay from reading-modifying-writing parity changes for every small write an os makes.

However, I ended up going RAID5 because my bottleneck seems to be the PERC5, no matter what RAID config I've benchmarked, I can't seem to exceed 150MB/s sequential, and ESX has a 2TB LUN limitation - the perc will allow multiple 2TB virtual disks on simple raid levels like RAID5 but it does not allow this for more complex configurations like RAID10.

So anyway, I went RAID5 and carved out two 2TB VDs and a ~800GB VD
 
I'd be interested to see what lopoetve has to say about your stripe size question. My thinking would be that you'd be wanting the stripe size of the controller and the block size of the filesystem to match, so bar any other requirements (maximum file size being the main consideration in choosing the VMFS block size) matching 1MB to 1MB or (if a PERC5 is capable of this, I know the PERC6 is) 2MB to 2MB would be preferable?

Where sub blocks come in (1/8th of the size of the VMFS blocksize) and how operations relating to them affect performance are a question in my mind. I know they'd be a factor in thin provisioning where used capacity in a volume is under tighter scruitiny, but would they affect performance unduly if the wrong stripe size was chosen?
 
However, I ended up going RAID5 because my bottleneck seems to be the PERC5, no matter what RAID config I've benchmarked, I can't seem to exceed 150MB/s sequential,

Sequential Throughput doesnt mean shit.
You are going to be IO bound, so if one raid level gives you higher IOPs then I would do that.

Again R10 will have better IOPs by design, but you may still be limited by your Perc5/i.
 
With RAID10, your not giving up 2TB, your giving up 3 btw. (how you typed that makes me assume your putting three 2 disk RAID 1 configurations in RAID 0.) But, if your not concerned about capacity, that would make sense to run for a configuration, 3 drives in RAID0 are plenty fast. If you want to keep as much capacity as possible, and still get great performance, go for the raid5 (its like RAID0 with a parity drive.. not quite as fast though, might still be faster than RAID10 though.. I'm not sure, haven't tested it myself).

As for stripe size, I don't have a comment. I almost always just run default stripe size because my workloads vary incredibly.

RAID5 has a significantly higher back-end IOPS hit than raid1/10/0+1. I agree with the rest :)

Thanks for the response.

I was referring to giving up 2TB when comparing to RAID5 not jbod.

RAID5 = 1 disk capacity lost
RAID10 = 3 disk capacity lost

From my research, RAID10 should be faster for writes and better in general for OS and guest use because of the delay from reading-modifying-writing parity changes for every small write an os makes.

However, I ended up going RAID5 because my bottleneck seems to be the PERC5, no matter what RAID config I've benchmarked, I can't seem to exceed 150MB/s sequential, and ESX has a 2TB LUN limitation - the perc will allow multiple 2TB virtual disks on simple raid levels like RAID5 but it does not allow this for more complex configurations like RAID10.

So anyway, I went RAID5 and carved out two 2TB VDs and a ~800GB VD

Not surprised - only so much those cards can handle. Writes are where RAID5 takes the hit (5 iop burned for every iop submitted to hte card, effectively)

I'd be interested to see what lopoetve has to say about your stripe size question. My thinking would be that you'd be wanting the stripe size of the controller and the block size of the filesystem to match, so bar any other requirements (maximum file size being the main consideration in choosing the VMFS block size) matching 1MB to 1MB or (if a PERC5 is capable of this, I know the PERC6 is) 2MB to 2MB would be preferable?

Where sub blocks come in (1/8th of the size of the VMFS blocksize) and how operations relating to them affect performance are a question in my mind. I know they'd be a factor in thin provisioning where used capacity in a volume is under tighter scruitiny, but would they affect performance unduly if the wrong stripe size was chosen?

Well, there are really two parts there... VMFS block size is a bit different. Most places run a 64k stripe, IIRC. some 32k. NetApp uses 4k by default :)

I'd stick with default, or test against your workload. Hard to say from this high a level.
 
Were you able to get ESXi working with a Perc 5i?
Only way I am able to get a stable datatsore is to turn VT-D off. I would love to hear your experiences.
 
what's the firmware like on your perc card, evandena?

And do you have the battery backed cache module that is GOOD?
 
Worked fine on my perc5i I have mine packed up to move right now so I can't check on it but I had an older dell firmware and no BBU. I planned to flash to the LSI firmware once I get it set up again.

Did you cover the SMBus contact on the card? I had a lot of stability issues until I covered mine.

[edit] that's assuming this is in a PC and not a poweredge...
 
what's the firmware like on your perc card, evandena?

And do you have the battery backed cache module that is GOOD?

I do not have a battery backup, and I have force write back enabled.
I have tried the previous two dell firmwares and am currently on the LSI from November 09.


Worked fine on my perc5i I have mine packed up to move right now so I can't check on it but I had an older dell firmware and no BBU. I planned to flash to the LSI firmware once I get it set up again.

Did you cover the SMBus contact on the card? I had a lot of stability issues until I covered mine.

[edit] that's assuming this is in a PC and not a poweredge...

It's in a PowerEdge T110.


I would go ballistic if I could get this stupid thing to work with VT-d.

I am unable to create or have a stable datastore using ESXi or ESX 3.5 or 4.0, installable or embeded. Each gives similar errors.

If I disable VT-D in the BIOS, everything works. But then I am not able to use 64 bit guests, which is kinda a bummer for a home test lab.

XenServer seems to work fine, but no memory over commit in the free version and limited paravirtualization is kind of a deal breaker.

I've researched the heck out of this problem, and a lot of people are seeing this, with different hardware setups. All with Perc 5i though. I can't imagine every Perc has this problem, and I can't imagine that only a few people have ESX installed with a Perc.

Anyone have any ideas? I've been working on this since October and I'm starting to go insane.
 
ah. Get the BBWC unit. ESX4 is more sensitive to that - we see it all the time :)

otherwise, it'll also be slow as shit.
 
ah. Get the BBWC unit. ESX4 is more sensitive to that - we see it all the time :)

otherwise, it'll also be slow as shit.

Or maybe I have force write back disabled... can't remember which. It's supposed to forget the battery and write it anyways. I really don't care if my a few bits of my home lab freak out on a power outage.... I just want 64 bit ESX. As it stands, I can get pretty good speeds on a 3 disk RAID5, around 130MBps through CIFS.

So you're saying a battery might help with the dang datastore issues? If so, I'll order one right now!
 
I'm saying it certainly might ;)

HP servers are the ones most impacted by it - the PX00 controllers really hate not having the battery, and performance is bad enough that it triggers an attempted failover... which fails, since there's no path to fail over TO. HP has a tech advisory out about it.

I've seen it on the PERC cards, but it's rarer. Would be worth a shot, and shouldn't be too much $$ either.
 
I'm having the exact same issue as Evandena with a Dell Perc 5/i.

Finding it impossible to get stable raid arrays on the Perc 5 with Vsphere or ESX 3.5.

I have a backup battery with WC enabled, battery is confirmed as working and there's no faults with the perc.

Just seems with ESX 3.5 or Vsphere this card with VT-D enabled = data corruption.

My settings are as follows:

4 x 320GB SATA 2 Seagates
Raid 10
128k block size
Write back enabled
Read Ahead enabled

I also have the BBU installed and confirmed working and charged, further to that i've also confirmed the 256MB ddr module is error free

I've currently got the latest Dell firmware installed, and also tried the LSI .74 and .75 firmwares (two most recent from LSI). Motherboard is a P5K-E Wifi with latest Bios, also tried the two previous.

I have no idea what further troubleshooting i can do, it seems this issue is wide spread with people with these cards.

Anyone got any suggestions?
 
Last edited:
well, you ARE using a SATA drive.

SATA does not have an ECC cache. It cannot correct for errors. Technically, SATA raid is not supported unless on an array that can do such ECC correction. I haven't seen problems with the PERC cards and most sata drives so far, but I'll do some digging once I'm back from vacation and let you all know.
 
put tape over the smbus connector on the Perc

Have you tried this ikioi? My initial research showed that this was only a solution on older intel chipsets, and the problem was getting the card to boot at all. I think we're beyond that, and seem to have a unique problem with Perc+ESX+VT-D.
 
yea it was originally a workaround for older chipsets.

I have not tried it w/ ESX

But you have nothing to lose, and it might work. Just throwing out ideas
 
I had to cover the smbus pins on mine to get it to remain stable, it would boot fine but drop out constantly. Chipset is an x58.
 
it's an nvidia / intel chipset known issue with the SMBus controllers and Perc cards.
 
But only in ESX. I'ts hard to say it's a hardware problem when it works with every other OS.
 
That's not quite what I'm saying ;)

The array is unstable because something with the card is not working correct. This may be that it was designed only to work with a specific bios (on a dell server), or something with the card itself, or something with the LSI firmware.

Hard to say. I'd need to see logs to know for sure.
 
Back
Top