[ceph-users] RADOS async client memory usage explodes when reading several objects in sequence

Daniel Goldbach dan.goldbach at gmail.com
Fri Sep 14 03:46:03 PDT 2018


One of my colleagues believes he's tracked down the source of the missing
deallocations in the librados code. I'll let him reply himself with his
findings. For now, we've found a workaround: when an object is reread, the
memory allocated for it seems to be freed and a new block is allocated for
the size of the new read. So you can effectively "free" the memory
allocated for an object by rereading one byte from it.

In my example, if after each read operation you execute a new read
operation that simply reads one byte from the same object, the memory usage
stabilises at a handful of MB, as expected.

On Thu, Sep 13, 2018 at 6:47 PM Daniel Goldbach <dan.goldbach at gmail.com>
wrote:

> Yes I understand that. If you look at the example, the data buffer is
> stack allocated and hence its memory is freed when the stack frame for
> readobj is destroyed. Additionally, no leak occurs if I comment out the
> rados_read_op_operate line. This is a problem with librados, not with my
> example.
>
>
>
> On Thu, Sep 13, 2018 at 6:22 PM Gregory Farnum <gfarnum at redhat.com> wrote:
>
>> On Thu, Sep 13, 2018 at 6:35 AM Daniel Goldbach <dan.goldbach at gmail.com>
>> wrote:
>>
>>> I'm sure I'm not forgetting to free any buffers. I'm not even allocating
>>> any heap memory in the example above.
>>>
>>> On further investigation, the same issue *does* happen with the
>>> synchronous read operation API. I erroneously said that the issue doesn't
>>> happen with the synchronous API when what I meant to say was that the issue
>>> doesn't happen when calling the rados_read() function directly to read from
>>> an object.
>>>
>>> Using the synchronous read operation API, the following four commands
>>> are sufficient to trigger the 'leak':
>>>
>>>     rados_read_op_t read_op = rados_create_read_op();
>>>     rados_read_op_read(read_op, 0, SIZE, data, &bytes_read, &retval);
>>>     retval = rados_read_op_operate(read_op, *io, objname, 0);
>>>     assert(retval == 0);
>>>     rados_release_read_op(read_op);
>>>
>>> Curiously, the presence of the final line "
>>> rados_release_read_op(read_op);" does not seem to affect the memory
>>> usage of the program at all! It can be removed and the program's memory
>>> profile looks the same. It's as if the release mechanism is totally broken.
>>>
>>> I agree that the source is most likely mismanagement of C_bl_to_buf. rados_release_read_op
>>> should call Objecter.h:~ObjectOperation() on the read op, which in turn
>>> should delete the the C_bl_to_buf whose pointer was pushed onto the
>>> ObjectOperation's out_handler vector in ObjectOperation::read(..).
>>> Something around that bit of code must be broken.
>>>
>>
>> Okay, but you get that *you* are responsible for cleaning up the "data"
>> buffer, right? librados isn't going to delete that for you automatically as
>> it has no way of knowing when it should do so.
>>
>>
>>>
>>>
>>>
>>> On Wed, Sep 12, 2018 at 5:35 PM Gregory Farnum <gfarnum at redhat.com>
>>> wrote:
>>>
>>>> That code bit is just "we have an incoming message with data", which is
>>>> what we'd expect, but means it's not very helpful for tracking down the
>>>> source of any leaks.
>>>>
>>>> My guess is still very much that somehow there are deallocations
>>>> missing here. Internally, the synchronous API is wrapping the async one so
>>>> it'd be subject to the same bugs. (Although perhaps not terrible malloc
>>>> behavior, but if valgrind reports the same memory usage as the RSS being
>>>> reported then I don't think it can be malloc behavior.)
>>>>
>>>> Are you actually freeing the buffers you provide when you're done with
>>>> them? I guess there could be something with the "C_bl_to_buf" structure
>>>> getting managed wrong as well since that looks to be unique to this code
>>>> path, but it wouldn't depend on the size of the objects since it's just 4
>>>> pointers/ints.
>>>> -Greg
>>>>
>>>> On Wed, Sep 12, 2018 at 8:43 AM Daniel Goldbach <dan.goldbach at gmail.com>
>>>> wrote:
>>>>
>>>>> The issue continues even when I do rados_aio_release(completion) at
>>>>> the end of the readobj(..) definition in the example. Also, in our
>>>>> production code we call rados_aio_release for each completion and we still
>>>>> see the issue there. The release command doesn't guarantee instant release,
>>>>> so could it be that the release operations are getting queued up but never
>>>>> executed?
>>>>>
>>>>> Valgrind massif shows that the relevant allocations are all happening
>>>>> in the bit of code in the following stack trace:
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------------
>>>>>   n        time(i)         total(B)   useful-heap(B) extra-heap(B)
>>>>> stacks(B)
>>>>>
>>>>> --------------------------------------------------------------------------------
>>>>>  62    166,854,775       82,129,696       81,808,615       321,081
>>>>>         0
>>>>>  63    168,025,321       83,155,872       82,834,072       321,800
>>>>>         0
>>>>> 99.61% (82,834,072B) (heap allocation functions) malloc/new/new[],
>>>>> --alloc-fns, etc.
>>>>> ->93.75% (77,955,072B) 0x579AC05:
>>>>> ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int)
>>>>> (in /usr/lib/ceph/libceph-common.so.0)
>>>>> | ->93.75% (77,955,072B) 0x597BB48: AsyncConnection::process() (in
>>>>> /usr/lib/ceph/libceph-common.so.0)
>>>>> | | ->93.75% (77,955,072B) 0x598BC96: EventCenter::process_events(int,
>>>>> std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in
>>>>> /usr/lib/ceph/libceph-common.so.0)
>>>>> | |   ->93.75% (77,955,072B) 0x5990816: ??? (in
>>>>> /usr/lib/ceph/libceph-common.so.0)
>>>>> | |     ->93.75% (77,955,072B) 0xE957C7E: ??? (in
>>>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
>>>>> | |       ->93.75% (77,955,072B) 0xE6896B8: start_thread
>>>>> (pthread_create.c:333)
>>>>> | |         ->93.75% (77,955,072B) 0x529741B: clone (clone.S:109)
>>>>> | |
>>>>> | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
>>>>> |
>>>>> ->05.87% (4,879,000B) in 387 places, all below massif's threshold
>>>>> (1.00%)
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 12, 2018 at 4:05 PM Gregory Farnum <gfarnum at redhat.com>
>>>>> wrote:
>>>>>
>>>>>> Yep, those completions are maintaining bufferlist references IIRC, so
>>>>>> they’re definitely holding the memory buffers in place!
>>>>>> On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbodley at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
>>>>>>> > Hi all,
>>>>>>> >
>>>>>>> > We're reading from a Ceph Luminous pool using the librados
>>>>>>> asychronous
>>>>>>> > I/O API. We're seeing some concerning memory usage patterns when
>>>>>>> we
>>>>>>> > read many objects in sequence.
>>>>>>> >
>>>>>>> > The expected behaviour is that our memory usage stabilises at a
>>>>>>> small
>>>>>>> > amount, since we're just fetching objects and ignoring their data.
>>>>>>> > What we instead find is that the memory usage of our program grows
>>>>>>> > linearly with the amount of data read for an interval of time, and
>>>>>>> > then continues to grow at a much slower but still consistent pace.
>>>>>>> > This memory is not freed until program termination. My guess is
>>>>>>> that
>>>>>>> > this is an issue with Ceph's memory allocator.
>>>>>>> >
>>>>>>> > To demonstrate, we create 20000 objects of size 10KB, and of size
>>>>>>> > 100KB, and of size 1MB:
>>>>>>> >
>>>>>>> >     #include <stdio.h>
>>>>>>> >     #include <stdlib.h>
>>>>>>> >     #include <string.h>
>>>>>>> >     #include <rados/librados.h>
>>>>>>> >
>>>>>>> >     int main() {
>>>>>>> > rados_t cluster;
>>>>>>> > rados_create(&cluster, "test");
>>>>>>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
>>>>>>> > rados_connect(cluster);
>>>>>>> >
>>>>>>> > rados_ioctx_t io;
>>>>>>> > rados_ioctx_create(cluster, "test", &io);
>>>>>>> >
>>>>>>> >         char data[1000000];
>>>>>>> > memset(data, 'a', 1000000);
>>>>>>> >
>>>>>>> >         char smallobj_name[16], mediumobj_name[16],
>>>>>>> largeobj_name[16];
>>>>>>> >         int i;
>>>>>>> >         for (i = 0; i < 20000; i++) {
>>>>>>> > sprintf(smallobj_name, "10kobj_%d", i);
>>>>>>> > rados_write(io, smallobj_name, data, 10000, 0);
>>>>>>> >
>>>>>>> > sprintf(mediumobj_name, "100kobj_%d", i);
>>>>>>> > rados_write(io, mediumobj_name, data, 100000, 0);
>>>>>>> >
>>>>>>> > sprintf(largeobj_name, "1mobj_%d", i);
>>>>>>> > rados_write(io, largeobj_name, data, 1000000, 0);
>>>>>>> >
>>>>>>> > printf("wrote %s of size 10000, %s of size 100000, %s of size
>>>>>>> 1000000\n",
>>>>>>> >       smallobj_name, mediumobj_name, largeobj_name);
>>>>>>> >         }
>>>>>>> >
>>>>>>> > return 0;
>>>>>>> >     }
>>>>>>> >
>>>>>>> >     $ gcc create.c -lrados -o create
>>>>>>> >     $ ./create
>>>>>>> >     wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000,
>>>>>>> 1mobj_0 of
>>>>>>> > size 1000000
>>>>>>> >     wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000,
>>>>>>> 1mobj_1 of
>>>>>>> > size 1000000
>>>>>>> >     [...]
>>>>>>> >     wrote 10kobj_19998 of size 10000, 100kobj_19998 of size
>>>>>>> 100000,
>>>>>>> > 1mobj_19998 of size 1000000
>>>>>>> >     wrote 10kobj_19999 of size 10000, 100kobj_19999 of size
>>>>>>> 100000,
>>>>>>> > 1mobj_19999 of size 1000000
>>>>>>> >
>>>>>>> > Now we read each of these objects with the async API, into the
>>>>>>> same
>>>>>>> > buffer. First we read just the the 10KB objects first:
>>>>>>> >
>>>>>>> >     #include <assert.h>
>>>>>>> >     #include <stdio.h>
>>>>>>> >     #include <stdlib.h>
>>>>>>> >     #include <string.h>
>>>>>>> >     #include <rados/librados.h>
>>>>>>> >
>>>>>>> >     void readobj(rados_ioctx_t* io, char objname[]);
>>>>>>> >
>>>>>>> >     int main() {
>>>>>>> >         rados_t cluster;
>>>>>>> > rados_create(&cluster, "test");
>>>>>>> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
>>>>>>> > rados_connect(cluster);
>>>>>>> >
>>>>>>> > rados_ioctx_t io;
>>>>>>> > rados_ioctx_create(cluster, "test", &io);
>>>>>>> >
>>>>>>> >         char smallobj_name[16];
>>>>>>> >         int i, total_bytes_read = 0;
>>>>>>> >
>>>>>>> >         for (i = 0; i < 20000; i++) {
>>>>>>> > sprintf(smallobj_name, "10kobj_%d", i);
>>>>>>> > readobj(&io, smallobj_name);
>>>>>>> >
>>>>>>> > total_bytes_read += 10000;
>>>>>>> > printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
>>>>>>> >         }
>>>>>>> >
>>>>>>> > getchar();
>>>>>>> >         return 0;
>>>>>>> >     }
>>>>>>> >
>>>>>>> >     void readobj(rados_ioctx_t* io, char objname[]) {
>>>>>>> >         char data[1000000];
>>>>>>> >         unsigned long bytes_read;
>>>>>>> > rados_completion_t completion;
>>>>>>> >         int retval;
>>>>>>> >
>>>>>>> > rados_read_op_t read_op = rados_create_read_op();
>>>>>>> > rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
>>>>>>> >         retval = rados_aio_create_completion(NULL, NULL, NULL,
>>>>>>> > &completion);
>>>>>>> > assert(retval == 0);
>>>>>>> >
>>>>>>> >         retval = rados_aio_read_op_operate(read_op, *io,
>>>>>>> completion,
>>>>>>> > objname, 0);
>>>>>>> > assert(retval == 0);
>>>>>>> >
>>>>>>> > rados_aio_wait_for_complete(completion);
>>>>>>> > rados_aio_get_return_value(completion);
>>>>>>> >     }
>>>>>>> >
>>>>>>> >     $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
>>>>>>> >     Read 10kobj_0 for total 10000
>>>>>>> >     Read 10kobj_1 for total 20000
>>>>>>> >     [...]
>>>>>>> >     Read 10kobj_19998 for total 199990000
>>>>>>> >     Read 10kobj_19999 for total 200000000
>>>>>>> >
>>>>>>> > We read 200MB. A graph of the resident set size of the program is
>>>>>>> > attached as mem-graph-10k.png, with seconds on x axis and KB on
>>>>>>> the y
>>>>>>> > axis. You can see that the memory usage increases throughout,
>>>>>>> which
>>>>>>> > itself is unexpected since that memory should be freed over time
>>>>>>> and
>>>>>>> > we should only hold 10KB of object data in memory at a time. The
>>>>>>> rate
>>>>>>> > of growth decreases and eventually stabilises, and by the end
>>>>>>> we've
>>>>>>> > used 60MB of RAM.
>>>>>>> >
>>>>>>> > We repeat this experiment for the 100KB and 1MB objects and find
>>>>>>> that
>>>>>>> > after all reads they use 140MB and 500MB of RAM, and memory usage
>>>>>>> > presumably would continue to grow if there were more objects. This
>>>>>>> is
>>>>>>> > orders of magnitude more memory than what I would expect these
>>>>>>> > programs to use.
>>>>>>> >
>>>>>>> >   * We do not get this behaviour with the synchronous API, and the
>>>>>>> >     memory usage remains stable at just a few MB.
>>>>>>> >   * We've found that for some reason, this doesn't happen (or
>>>>>>> doesn't
>>>>>>> >     happen as severely) if we intersperse large reads with much
>>>>>>> >     smaller reads. In this case, the memory usage seems to
>>>>>>> stabilise
>>>>>>> >     at a reasonable number.
>>>>>>> >   * Valgrind only reports a trivial amount of unreachable memory.
>>>>>>> >   * Memory usage doesn't increase in this manner if we repeatedly
>>>>>>> read
>>>>>>> >     the same object over and over again. It hovers around 20MB.
>>>>>>> >   * In other experiments we've done, with different object data and
>>>>>>> >     distributions of object sizes, we've seen memory usage grow
>>>>>>> even
>>>>>>> >     larger in proportion to the amount of data read.
>>>>>>> >
>>>>>>> > We maintain a long-running (order of weeks) services that read
>>>>>>> objects
>>>>>>> > from Ceph and send them elsewhere. Over time, the memory usage of
>>>>>>> some
>>>>>>> > of these services have grown to more than 6GB, which is
>>>>>>> unreasonable.
>>>>>>> >
>>>>>>> > --
>>>>>>> > Regards,
>>>>>>> > Dan G
>>>>>>> >
>>>>>>> >
>>>>>>> > _______________________________________________
>>>>>>> > ceph-users mailing list
>>>>>>> > ceph-users at lists.ceph.com
>>>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>> It looks like the async example is missing calls to
>>>>>>> rados_aio_release()
>>>>>>> to clean up the completions. I'm not sure that would account for all
>>>>>>> of
>>>>>>> the memory growth, but that's where I would start. Past that,
>>>>>>> running
>>>>>>> the client under valgrind massif should help with further
>>>>>>> investigation.
>>>>>>>
>>>>>>> Casey
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users at lists.ceph.com
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users at lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Dan Goldbach
>>>>>
>>>>
>>>
>>> --
>>> Regards,
>>> Dan Goldbach
>>>
>>
>
> --
> Regards,
> Dan Goldbach
>


-- 
Regards,
Dan Goldbach
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20180914/f55395ec/attachment.html>


More information about the ceph-users mailing list