[ceph-users] Cephfs Hadoop Plugin and CEPH integration

Orit Wasserman owasserm at redhat.com
Wed Nov 29 09:42:57 PST 2017

On Wed, Nov 29, 2017 at 6:52 PM, Aristeu Gil Alves Jr
<aristeu.jr at gmail.com> wrote:
>> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
>> > for
>> > local processing data as have cephfs hadoop plugin?
>> >
>> With s3 and swift you won't have data locality as it was designed for
>> public cloud.
>> We recommend disable locality based scheduling in Hadoop when running
>> with those connectors.
>> There is on going work on to optimize those connectors to work with
>> object storage.
>> Hadoop community works on the s3a connector.
>> There is also https://github.com/SparkTC/stocator which is a swift
>> based connector IBM wrote  for their cloud.
> Assuming this cases, how would be a mapreduce process without data locality?
> How the processors get the data? Still there's the need to split the data,
> no?
The s3/swift storage splits the data.

> Doesn't it severely impact the performance of big files (not just the
> network)?
There is a facebook research paper showing locality is not as good as
expected, if I remember correctly it was around 30%.
The users that use s3/swift with Hadoop are already using object
storage (for other usages) or have a very very big data set that fits
object storage better.

> --
> Aristeu

More information about the ceph-users mailing list