[ceph-users] Cephfs Hadoop Plugin and CEPH integration

Orit Wasserman owasserm at redhat.com
Wed Nov 29 09:38:13 PST 2017

On Wed, Nov 29, 2017 at 6:54 PM, Gregory Farnum <gfarnum at redhat.com> wrote:
> On Wed, Nov 29, 2017 at 8:52 AM Aristeu Gil Alves Jr <aristeu.jr at gmail.com>
> wrote:
>>> > Does s3 or swifta (for hadoop or spark) have integrated data-layout
>>> > APIs for
>>> > local processing data as have cephfs hadoop plugin?
>>> >
>>> With s3 and swift you won't have data locality as it was designed for
>>> public cloud.
>>> We recommend disable locality based scheduling in Hadoop when running
>>> with those connectors.
>>> There is on going work on to optimize those connectors to work with
>>> object storage.
>>> Hadoop community works on the s3a connector.
>>> There is also https://github.com/SparkTC/stocator which is a swift
>>> based connector IBM wrote  for their cloud.
>> Assuming this cases, how would be a mapreduce process without data
>> locality?
>> How the processors get the data? Still there's the need to split the data,
>> no?
>> Doesn't it severely impact the performance of big files (not just the
>> network)?
> Given that you already have your data in CephFS (and have been using it
> successfully for two years!), I'd try using its Hadoop plugin and seeing if
> it suits your needs. Trying a less-supported plugin is a lot easier than
> rolling out a new storage stack! :)

completely agree :)

> -Greg

More information about the ceph-users mailing list