Darryl Gove's blog
mtmalloc vs umem
A little while back I was looking at the performance of the STL with multithreaded code. I got the opportunity to try this on a particular code, and rather shockingly to me performance was absolutely terrible! I'd linked the code with mtmalloc, and the hottest function in the profile was malloc_internal. I've put together a fake code, and here's the profile from that:
Excl. Incl. Name User CPU User CPU sec. sec. 266.446 266.446258.301 263.084 malloc_internal 1.661 1.951 free 1.401 1.401 mutex_lock_impl 0.961 0.961 mutex_unlock
We can dig into the disassembly of malloc_internal to find out what's going on:
73.201 73.201 [?] 1724: cmp %o5, 0
1.981 1.981 [?] 1728: bne 0x1740
0.320 0.320 [?] 172c: nop
0.490 0.490 [?] 1730: ld [%i2 + 44], %i2
1.191 1.191 [?] 1734: cmp %i2, 0
0.901 0.901 [?] 1738: bne,a 0x1724
## 176.443 176.443 [?] 173c: ld [%i2 + 32], %o5
It's not hard to visualise what the original C code would look like:
while ((ptr->value==0) && (ptr->next!=0)) { ptr=ptr->next; }
Fortunately the source code is searchable and the above loop looks sufficiently similar to line 1032 of mtmalloc.c:
1032 while (thiscache != NULL && thiscache->mt_nfree == 0) 1033 thiscache = thiscache->mt_next;
So what's going on?
Reading through the source of malloc_internal, it appears that mtmalloc builds up a linked list of chunks of memory for each size of memory request. The size of the chunks of memory is 8KB*requestsize, and requestsize is 9. So basically each chunk of memory is 72KB in size. So when a memory request comes in, malloc_internal looks at the current chunk, and if memory can be allocated from there, then it returns memory from that chunk. If not it goes to the next chunk and so on. This works very well when memory is allocated at once, but as memory gets freed, these chunks of memory become like Swiss-cheese, with lots of holes in them. If a lot of memory of a particular size is requested, then freed, there can be a large number of these chunks in the linked list, and scanning through the chunks to find one with free space can be time consuming. And that is the condition that my test code exercises.
It's probably worth revealing the test code, at this point, so that you can see what it does:
#include <stdlib.h>
typedef struct s
{
struct s * next;
char padding[508];
} S;
void main()
{
struct s * head;
struct s * keep;
struct s * current;
head=0;
keep=0;
for (int j=0; j<100; j++)
{
for (int i=0; i<100000; i++)
{
current=(struct s*)malloc(sizeof(struct s));
if (i&1)
{
current->next=head;
head=current;
}
else
{
current->next=keep;
keep=current;
}
}
current = head;
while (current!=0)
{
struct s * tmp = current;
current = current -> next;
free(current);
}
head = 0;
}
}
The code maintains two lists, one that it places memory onto for a long duration, and another list that holds memory for only a short duration. The memory footprint of the code keeps increasing, so more chunks are added to the lists, and holding on to the memory for a long period of time ensures that the chunks end up with lots of gaps in them. The runtime of this code is as follows:
% cc -O mtm.c -lmtmalloc % timex a.out real 4:44.18 user 4:33.80 sys 8.70
However there is an API to libmtmalloc that allows us to adjust the size of the chunks. The following changes increase the requestsize from 9 to 20:
#include... mallocctl(MTCHUNKSIZE,20); ...
The performance reduces from nearly 5 minutes to about 1 minute:
% cc -O mtm.c -lmtmalloc % timex a.out real 1:09.10 user 1:01.09 sys 6.53
If we increase the requestsize to 30, performance improves still further:
% cc -O mtm.c -lmtmalloc % timex a.out real 38.36 user 31.41 sys 4.96
Of course, libmtmalloc is not the only memory allocator that is optimised for multi-threaded allocation. We also have libumem, compiling the original code to use this results in the following performance:
% cc -O mtm.c -lumem % timex a.out real 31.06 user 18.10 sys 10.95
So this is probably a good indication that you will get better performance from libumem if your application allocates and deallocates lots of memory. If you are using libmtmalloc in this role, then you may need to tune the requestsize to a greater number than the default - although this will increase the memory footprint of your application.
Posted at 11:15AM Jun 26, 2009 by Darryl Gove in Sun |


