[ceph-users] Cephfs Hadoop Plugin and CEPH integration

Gregory Farnum gfarnum at redhat.com
Mon Nov 27 11:46:58 PST 2017

On Mon, Nov 27, 2017 at 12:55 PM Aristeu Gil Alves Jr <aristeu.jr at gmail.com>

> Hi.
> It's my first post on the list. First of all I have to say I'm new on
> hadoop.
> We are here a small lab and we have being running cephfs for almost two
> years, loading it with large files (4GB to 4TB in size). Our cluster is
> with approximately with 400TB with ~75% of usage, and we are planning to
> grow a lot.
> Until now, we did process most of the files the "serial reading" way. But
> now we will try to implement a parallel process on this files and we are
> looking on the hadoop plugin as a solution for using mapreduce, or
> something like that.
> Does the hadoop plugin access cephfs over the network as a normal cluster
> or I can install the hadoop's processors on every ceph node and process the
> data locally?
The Hadoop plugin both
1) accesses CephFS over the network as a normal client+cluster,
2) is fully integrated with the data-layout APIs, so if you install Hadoop
on the Ceph nodes it will generally schedule work on the primary OSD for
the data chunk in question.

So, it works almost the same in terms of data and network as HDFS does.
(HDFS will usually do a local write for one of its copies; CephFS+Hadoop
doesn't do that.) A few caveats though:
1) the plugin is not maintained very well. It was updated a few years ago
for the Hadoop 2.x API changes, and I've seen a few PRs from users go by
updating minor things, so it should still be good. But there's not any
proactive work going on in the core upstream development teams.
2) Data you've currently got stored in CephFS is probably in 4MB chunks, as
that's the default. When using Hadoop we default to 64MB for new data.
Hadoop is unlikely to want to schedule a different job for each 4MB piece
of data, so you will probably get more network traffic on your existing
data than you'd otherwise expect.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171127/9193e593/attachment.html>

More information about the ceph-users mailing list