[ceph-users] Cephfs Hadoop Plugin and CEPH integration

Gregory Farnum gfarnum at redhat.com
Wed Nov 29 08:54:39 PST 2017

On Wed, Nov 29, 2017 at 8:52 AM Aristeu Gil Alves Jr <aristeu.jr at gmail.com>

> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
>> for
>> > local processing data as have cephfs hadoop plugin?
>> >
>> With s3 and swift you won't have data locality as it was designed for
>> public cloud.
>> We recommend disable locality based scheduling in Hadoop when running
>> with those connectors.
>> There is on going work on to optimize those connectors to work with
>> object storage.
>> Hadoop community works on the s3a connector.
>> There is also https://github.com/SparkTC/stocator which is a swift
>> based connector IBM wrote  for their cloud.
> Assuming this cases, how would be a mapreduce process without data
> locality?
> How the processors get the data? Still there's the need to split the data,
> no?
> Doesn't it severely impact the performance of big files (not just the
> network)?
Given that you already have your data in CephFS (and have been using it
successfully for two years!), I'd try using its Hadoop plugin and seeing if
it suits your needs. Trying a less-supported plugin is a lot easier than
rolling out a new storage stack! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171129/2e8955ec/attachment.html>

More information about the ceph-users mailing list