[ceph-users] SSD recommendations for OSD journals

Chen, Xiaoxi xiaoxi.chen at intel.com
Mon Jul 22 08:02:50 PDT 2013


Hi,
       My 0.02 :

       > Secondly, I'm unclear about how OSDs use the journal. It appears they write to the journal (in all cases, can't be turned

>off), ack to the client and then read the journal later to write to backing storage. Is that correct?


I would like to say NO, the journal will NEVER BE READ except recoverying( replay the journal in that case).
There are two configurations named 'filestore journal parallel' and  'filestore journal writeahead ",
with "journal parallel", the data will be write to both journal and OSD in parallel, either journal or OSD finished the write, ceph will ack to the client.This is ONLY for BTRFS ,since BTRFS has bulid-in mechanism which can help to keep consistency/
With "journal writeahead",the data first write to journal ,ack to the client, and write to OSD, note that, the data always keep in memory before it write to both OSD and journal,so the write is directly from memory to OSDs. This mode suite for XFS and EXT4.
The term "wirte to journal " means the data is physically write into journal, but not for "write to OSD", ceph open the file in OSD withOUT O_DIRECT so the write will goes to pagecache (kernel cache).

>On a similar note, I am using XFS on the OSDs which also journals, does this affect performance in any way?
Again ,NO, journal in XFS only journal File system related metadata, it never journal the data extend, so you can not rely on the XFS journal.

         > Can you share any information on the SSD you are using, is it PCIe connected?
       Depends, if you use HDD as your OSD data disk,  a SATA/SAS SSD is enough for you. Instead of Intel 520, I would like to suggest you use the Intel DCS3700 since it provide better durability for write. Since a DCS3700 can provide 400~500MB/s for write and HDD can only have ~100MB/s ,it's safe for a DCS3700 to provide journal for 4~5 HDDs.
        And , if you have some insight/assumption on your workload, say " I don't care throughtput at all , all my workload doing random access". With such assumption , you can have very high SSD:HDD ratio, 8:1 or even 10:1 will also be fine
        But if you want to use SSD as data disk, you may need to find something really really fast to journal the SSD. High-end PCIE-SSD or NVRAM may be the choice.

                                                                                              Xiaoxi


From: ceph-users-bounces at lists.ceph.com<mailto:ceph-users-bounces at lists.ceph.com> [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of Charles 'Boyo
Sent: Monday, July 22, 2013 5:04 AM
To: Mikaël Cluseau
Cc: ceph-users at lists.ceph.com<mailto:ceph-users at lists.ceph.com>
Subject: Re: [ceph-users] SSD recommendations for OSD journals

Thank you for the information Mikael.
Counting on the kernel's cache, it appears I will be best served purchasing write-optimized SSDs?
Can you share any information on the SSD you are using, is it PCIe connected?
Another question, since the intention of this storage cluster is relatively cheap storage on commodity hardware, what's the balance between cheap SSDs and reliability since journal failure might result in data loss or will such an event just 'down' the affected OSDs? On a similar note, I am using XFS on the OSDs which also journals, does this affect performance in any way?

Charles

On Sun, Jul 21, 2013 at 9:27 PM, Mikaël Cluseau <mcluseau at isi.nc<mailto:mcluseau at isi.nc>> wrote:
Hi,


On 07/22/13 06:05, Charles 'Boyo wrote:

Secondly, I'm unclear about how OSDs use the journal. It appears they write to the journal (in all cases, can't be turned off), ack to the client and then read the journal later to write to backing storage. Is that correct?

Yes



I'm coming from enterprise ZFS with an SSD is also used for write journalling but data flushes are from the disk cache in memory, hence the use of write optimized SSDs. Why can't Ceph be configured to write from RAM instead of reading the journal on flush?

>From my stats I can tell that the journal flushes use the kernel's cache and do not hit the SSD. Here, sdd is my journal SSD :



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20130722/c31ad1f0/attachment-0004.htm>


More information about the ceph-users mailing list