You have read a lot about fast SSDs which do not perform well in a ceph cluster?
SSDs which die fast in a ceph cluster?
Shall you prefer HDDs over SSDs?
You have no clue how to determine what is true or what is a myth?
To understand this article fully you shall read also this both articles:
- What disk or drive types to use with ceph?
- How ceph is handling atomic persisting of IOs/objects
- Ceph storage devices
Available tools under linux for our tests
Under linux we are pretty happy to have a bunch of tools, which help us to test read in depth our drives. Well known tools are
- fio
flexible I/O tester - dd
disk dump
Controller dependent impacts on your tests
Linux use disk caches, so also do hardware controllers. These caches will cause wrong tests, since we will test the speed of the RAM instead the speed of the drive.
Disable the linux cache
Linux uses for all drives a cache. Here we need first to disable the write-cache of linux on our drive
sudo hdparm -W 0 /dev/hda 0
In the above example we disable the write cache for the drive /dev/hda
Disable HP controller cache
HP controllers have typically RAID array management, which itself requires also controller based caches to optimize the write and read performance. This caches we also need to disable!
sudo hpacucli ctrl slot=2 modify dwc=disable
sudo hpacucli controller slot=2 logicaldrive 1 modify arrayaccelerator=disable
The above command assumes, that your HP controller is in slot 2 and your drive is the number 1:
Using fio
For the more thorough tests we need to use fio. This is the sample command for our tests:
sudo fio --filename=/dev/<yourDevice> --direct=1 --sync=1 --rw=write --bs=4k --numjobs=<numOfConcurrentThreads> --iodepth=1 --runtime=600 --time_based --group_reporting --name=journal-test
Let us dive into the command options:
- –filename
the device we want to test.
E.g. /dev/sda or /dev/nvme0n1 - –direct
here we tell fio to work with O_DIRECT (details see Performance considerations for journal / metadata disks) - –sync
here we teill fio to work with O_DSYNC (details see Performance considerations for journal / metadata disks) - –rw
what IO pattern to use. In our case we use write for sequential writes as journal/metadata writes are always sequential (details see Performance considerations for journal / metadata disks)) - –bs
this is the block size. With the blocksize we simulate the object size handed over from a client to our OSD.. In our example here, we are submitting 4K objects.
4k is is probably a worst case scenario. If you know your workload in terms of block size, you can here modify it to your needs - –numjobs
here we simulate the amount of concurrent client accesses to our ceph OSD. number of threads that will be running, think this has ceph-osd daemons writing to the journal - –iodepth
we are submitting IO one by one. - –runtime
job duration in seconds. Here you shall test different durations to ensure that all optimizations of the HDD/SSD vendor gets overridden to get down to the real performance values of your drive.
The drive optimizations are typically tiering of different storage types. So this tiers are helping to increase the overall performance. Means something like memory, caches and on SSDs faster SSD cells, as the cells for the final data storage on SSDs. - –time_based
if you have a fast drive, the test could finish the test before reaching the “runtime” limit. So this ensures to run the test for the specified runtime. So basically it reruns the operation over an over again until the total runtime is reached. - –group_reporting
This tells fio to report a overall value. If you have multiple threads (numjobs) usually fio will report each job independently. But the independent report does not help us here. So we group the results into one overall result with this parameter. - –name
name of this run/test
IMPORTANT NOTICE
The above command will overwrite your data on the drive! So be careful to take the correct drive, else you operating system or even your data is getting overwritten!
Tests to saturate your fastest drives
Your disks can be so fast that you need multiple tests to max out (saturate) the disks. We can try this with the runtime or with the numjobs parameters.
Test via long running tests
As named already above you need to test different runtimes with –runtime parameter to saturate the caches or drive tunings. Start with 60 seconds and increase it step wise to 120, 180. My tests are often this:
- 60
start test - 180
second test - 600
third test
If the 600 seconds test has nearly the same values as the previous tests, you either have already the saturation or we cannot reach itdo not reach a saturation
Test via parallel running jobs
If you have really fast drives, the above named test will not saturate your drive. To saturate the drive we have to run more parallel threads against the drive with the parameter –numjobs. My tactic here is to try first one thread and to get up by one increment.
- –numjobs=1
- –numjobs=2
- –numjobs=3
- –numjobs=4
Using dd
Sometimes you do not have the fio on you system nor the option to install it. In this case you can use also dd. This is the sample command for our dd tests. First let us generate a random test file and we ensure that the file is synced to disk
sudo dd if=/dev/urandom of=randomtestfile bs=1M count=1024 sudo sync
now lets run the test against our drive
sudo dd if=randomtestfile of=/dev/<yourdrive> bs=4k count=100000 oflag=direct,dsync
Let us dive into the command options:
- if
the input file, which is either our test file or the random device - of
the output file, which is either our test file or our drive - bs
this is the block size. With the blocksize we simulate the object size handed over from a client to our OSD.. In our example here, we are submitting 4K objects.
4k is is probably a worst case scenario. If you know your workload in terms of block size, you can here modify it to your needs - count
here we tell the the dd command how many operations to process - oflag
here we define the IO operations like O_DIRECT and O_DSYNC (details see Performance considerations for journal / metadata disks)
Community driven performance measurings
You can submit your own measurings also here to let other users reduce their work too, which you also have profited yourself. So please share your details, it costs you only one minute, which helps others to reduce maybe hours of days of senseless analysis.
No entries found.