How ceph is handling atomic persisting of IOs/objects

You have read about SSDs which do not perform well in a ceph cluster?
You have no clue how to determine what is true or what is a myth?

To understand this article fully you shall read also this both articles prior reading this article.

So how does ceph write and thus persist IO packages to the storage devices?

Now lets deep dive into linux file system operations, which are crucial to understand how ceph is working and performing. Let us assume we have a ceph cluster with a replica count of 3 (3 independent copies of the data)

<image>

  1. an object is send by a ceph client to the primary OSD
  2. this object results in an input/output operation (IO) package on the OSD
  3. due to the ceph design to ensure objects are persisted in atomic writes (all or nothing), the OSD writes the object first to a journal (more details in atomic writes). This is also done to decouple the write blocking situation for the ceph clients and also the replications can be done in the background.
  4. after having a savely persisted object in the journal a second operation is triggered to persist the object into the filestore with a buffered (buffered_io) and therefore cached and performance optimized write operation.
  5. the above steps 1..4 repeat for the second OSD to ensure the second replica
  6. the above steps 1..4 repeat for the third OSD to ensure the third replica

As you see the steps 1..4 require at least 2 IOPS to write once the data. The second replica require again at least 2 IOPS and the third replica require also at least 2 IOPS. So you can easily determine your required total IOPS of your cluster by your required net IOPS.

total IOPS = (required net IOPS) * (2 IOPS per replica) * (amount of replica)

Let assume your system requires about 1.000 net IOPS for the application. As storage admin you want to have 3 replicas. This would result into:

1.000 IOPS * 2 * 3 = 6.000 IOPS in total in your ceph cluster and at least 2.000 IOPS per replica. So you have to consider this for your sizing of your nodes and for your distributions of your replicas.

Atomic writes

Since we want to have a atomic writes, means our object is written fully or nothing is written (all or nothing), ceph uses for the journal writing libaio with O_DIRECT and O_DSYNC.

  • O_DIRECT
    ensures that the linux kernel will directly via DMA (Direct memory access; if possible) the data onto the device. This avoid having copy operations on data from user space to kernel space, and will instead write it directly. This results in object data does not go into caches. Sadly here we do not have a strict guarantee that the function will return only after all data has been transferred. Therefore it is also setting
  • O_SYNC
    which guarantees that the call will not return before all data has been transferred to the disk. (If the disk uses a cache, here the OS cannot give a guarantee anymore, but this operation goes as far as possible with the guarantee)

Considering both factors, that there is a small chance of having issues with cache of a disk, you may now understand, why storage admins prefer to avoid having caches on their drives and disks. But this is a topic for a different post on this page.

What is a journal and especially in ceph?

Journaling is in the IT business the process there you ensure and track operations. This means:

  • Ensure data consistency with the given transaction. In ceph it acts like a traditional filesystem journal, so it can replay the operation if something goes wrong
  • Provide atomic transactions, by keeping track of what was committed and what is going to be committed.
  • The journal itself is written always sequentially (to ensure consistency and replay possibility)
  • ensures therefore the order stays the same and therefore works in the first-in/first-out FIFO principle

Isn’t filestore the obsoleted data store in ceph?

Yes it is, but it makes it much easier to explain the principles behind ceph. It makes it also more clear why ceph requires a good networking connection for the ceph clients down to the OSDs. Also a good internal bandwidth of the components in the servers are needed to persist the data inbetween the OSDs and Journal and for each replica efficiently.

In the meanwhile ceph recommends to use the so called blue stores. Which have basically the same operations as mentioned above. It is a bit more efficient and a blue store can be anything which is required to get a persistence in ceph.

Which drives to choose for the ceph journal / metadata?

This will be answered in detail in How to test a SSD for compatibility for ceph journal disk?