[ceph-users] Cephfs Hadoop Plugin and CEPH integration
owasserm at redhat.com
Wed Nov 29 09:42:57 PST 2017
On Wed, Nov 29, 2017 at 6:52 PM, Aristeu Gil Alves Jr
<aristeu.jr at gmail.com> wrote:
>> > Does s3 or swifta (for hadoop or spark) have integrated data-layout APIs
>> > for
>> > local processing data as have cephfs hadoop plugin?
>> With s3 and swift you won't have data locality as it was designed for
>> public cloud.
>> We recommend disable locality based scheduling in Hadoop when running
>> with those connectors.
>> There is on going work on to optimize those connectors to work with
>> object storage.
>> Hadoop community works on the s3a connector.
>> There is also https://github.com/SparkTC/stocator which is a swift
>> based connector IBM wrote for their cloud.
> Assuming this cases, how would be a mapreduce process without data locality?
> How the processors get the data? Still there's the need to split the data,
The s3/swift storage splits the data.
> Doesn't it severely impact the performance of big files (not just the
There is a facebook research paper showing locality is not as good as
expected, if I remember correctly it was around 30%.
The users that use s3/swift with Hadoop are already using object
storage (for other usages) or have a very very big data set that fits
object storage better.
More information about the ceph-users