#5816: Kernel out of memory ---------------------------+------------------------------------------------ Reporter: axeld | Owner: bonefish Type: bug | Status: assigned Priority: normal | Milestone: R1/alpha3 Component: System/Kernel | Version: R1/Development Keywords: | Blockedby: Patch: 0 | Platform: All Blocking: 4963 | ---------------------------+------------------------------------------------ Comment(by bonefish): Just finished running the test again with some more debug output. The low resource state transitions are logged and block_cache::_LowMemoryHandler() prints how the RemoveUnusedBlocks() changed the number of unused blocks. The test started again at about 2-3 MB/s and -- I hadn't noticed this before -- after a few seconds reaches even > 10 MB/s for a short time. Afterwards it varies from 1 to 10 MB/s until the address space "note" low resource state is reached. Then the throughput drops immediately to < 1 MB, varying from 10 KB/s to a few 100 KB/s, only from time to time recovering to 1 or 2 MB/s (sometimes even more) for a few seconds. Here's the output from around the point where the low resource limit is hit: {{{ heap_add_area: area 3649 added to small heap 0xd090c000 - usable range 0xc2c07000 - 0xc3000000 slab memory manager: created area 0xc5800000 (3650) slab memory manager: created area 0xc6000000 (3651) low resource address space: normal -> note block_cache::_LowMemoryHandler(): 0xcd92a198: unused: 25997 -> 25997 block_cache::_LowMemoryHandler(): 0xcd92a7f8: unused: 3 -> 0 block_cache::_LowMemoryHandler(): 0xccd27b28: unused: 313434 -> 313384 block_cache::_LowMemoryHandler(): 0xcfd1f660: unused: 13 -> 0 block_cache::_LowMemoryHandler(): 0xcd92a198: unused: 2598 -> 2598 block_cache::_LowMemoryHandler(): 0xcd92a7f8: unused: 0 -> 0 block_cache::_LowMemoryHandler(): 0xccd27b28: unused: 311848 -> 311798 block_cache::_LowMemoryHandler(): 0xcfd1f660: unused: 0 -> 0 block_cache::_LowMemoryHandler(): 0xcd92a198: unused: 22 -> 22 block_cache::_LowMemoryHandler(): 0xcd92a7f8: unused: 0 -> 0 block_cache::_LowMemoryHandler(): 0xccd27b28: unused: 311378 -> 311328 block_cache::_LowMemoryHandler(): 0xcfd1f660: unused: 0 -> 0 block_cache::_LowMemoryHandler(): 0xcd92a198: unused: 0 -> 0 block_cache::_LowMemoryHandler(): 0xcd92a7f8: unused: 0 -> 0 block_cache::_LowMemoryHandler(): 0xccd27b28: unused: 311244 -> 311194 block_cache::_LowMemoryHandler(): 0xcfd1f660: unused: 0 -> 0 }}} The first block cache (0xcd92a198) is the one of the volume I'm copying, the third one (0xccd27b28) from the target volume (both on the same disk). As can be seen Tracker's pre-flight phase has let the number of unused blocks of the source volume grow to > 300000. The 25997 blocks of the target volume are probably mostly the ones that have already been written. After only three seconds the target cache has been drained completely. RemoveUnusedBlocks() didn't remove a single block, though, since at level "note" only blocks older than two minutes are removed. The reason that all blocks are removed nonetheless is block_cache::Allocate(), which removes 2 unused blocks each time it is called. And since the target volume is quite active -- lots of writing and transactions going on -- that happens quite often. This would also explain the abysmal performance from that point on, since all blocks that become unused even for a short time are thrown away and have to be re-read from disk when they are needed again. The low resource behavior definitely needs to be rethought. The source volume cache lost at max a few hundred blocks per second. It took about 1:45 h to drop to 0. Interestingly the whole time not a single slab are was released -- which must be a bug/leak somewhere -- and the heap kept growing -- likely for vnodes, file caches and BFS data structures, which don't care about the address space state. After about 2:20 h (80 % copied) the heap grower failed to allocate a heap area. {{{ kdebug> low_resource current state: ---a pages: normal (723271) memory: normal (3025899520) sems: normal (64352) aspace: note (28450816) function data resources prio function-name 0x800df218 0x00000000 p m a 5 object_cache_low_memory__FPvUll 0x800bad8c 0x00000000 p m 0 vnode_low_resource_handler__FPvUll 0x8003d044 0xcd92a198 p m a 0 _LowMemoryHandler__11block_cachePvUll 0x8003d044 0xcd92a7f8 p m a 0 _LowMemoryHandler__11block_cachePvUll 0x8003d044 0xccd27b28 p m a 0 _LowMemoryHandler__11block_cachePvUll 0x8003d044 0xcfd1f660 p m a 0 _LowMemoryHandler__11block_cachePvUll }}} Since there are still 28 MB address space left (probably fragmented or partially reserved), the address space low resource state is still "note". "Warning" state is currently reached at 8 MB, "critical" at utterly useless 32 KB. This is obviously something that needs to be changed. As for the block caches: {{{ kdebug> block_cache 0xcd92a198 BLOCK CACHE: 0xcd92a198 fd: 0 max_blocks: 26214048 block_size: 2048 next_transaction_id: 198 buffer_cache: 0x828a1dc0 busy_reading: 0, no waiters busy_writing: 0, no waiters 797 blocks total, 311 dirty, 0 discarded, 0 referenced, 0 busy, 486 in unused. kdebug> block_cache 0xccd27b28 BLOCK CACHE: 0xccd27b28 fd: 7 max_blocks: 26214048 block_size: 2048 next_transaction_id: 4 buffer_cache: 0x828a6a00 busy_reading: 0, no waiters busy_writing: 0, no waiters 4 blocks total, 3 dirty, 0 discarded, 0 referenced, 0 busy, 1 in unused. }}} They don't know a lot of blocks. The slabs: {{{ kdebug> slabs address name objsize usage empty usedobj total flags [...] 0x82808d00 cached blocks 64 12124160 178 814 189255 20000000 0x828a1dc0 block cache buffers 2048 122683392 192 1622 59904 20000000 0x828a5000 block cache buffers 4096 524288 0 2 128 20000000 0x828a6a00 block cache buffers 2048 642777088 1225 11 313856 20000000 0x828a6280 block cache buffers 2048 524288 1 0 256 20000000 kdebug> slab_cache 0x82808d00 name: cached blocks lock: 0x82808d28 object_size: 64 cache_color_cycle: 0 total_objects: 189255 used_count: 814 empty_count: 178 pressure: 354 slab_size: 65536 usage: 12124160 maximum: 0 flags: 0x20000000 cookie: 0x00000000 resize entry don't wait: 0x00000000 resize entry can wait: 0x00000000 slab chunk size used offset free empty: 0xe902ffe4 0xe9020000 1023 1023 0 0xe90217fc 0xe804ffe4 0xe8040000 1023 1023 0 0xe8044afc 0x82efffe4 0x82ef0000 1023 1023 0 0x82efbebc [...] partial: 0xda05ffe4 0xda050000 1023 491 0 0xda057cbc 0xe085ffe4 0xe0850000 1023 1011 0 0xe085733c 0x8291ffe4 0x82910000 1023 1010 0 0x8291927c 0xdf04ffe4 0xdf040000 1023 1018 0 0xdf047ebc 0xdd81ffe4 0xdd810000 1023 926 0 0xdd8151bc 0xd905ffe4 0xd9050000 1023 944 0 0xd905eafc 0xe105ffe4 0xe1050000 1023 947 0 0xe105773c depot: full: 0x00000000, count 16 empty: 0x00000000, count 0 max full: 16 capacity: 32 stores: [0] loaded: 0x00000000 previous: 0x00000000 [1] loaded: 0x00000000 previous: 0x00000000 [2] loaded: 0x828ce140 previous: 0x00000000 [3] loaded: 0x00000000 previous: 0x00000000 [4] loaded: 0x828ce780 previous: 0x00000000 [5] loaded: 0x828ceaa0 previous: 0x00000000 [6] loaded: 0x00000000 previous: 0x00000000 [7] loaded: 0x828ce8c0 previous: 0x00000000 kdebug> slab_cache 0x828a6a00 name: block cache buffers lock: 0x828a6a28 object_size: 2048 cache_color_cycle: 0 total_objects: 313856 used_count: 11 empty_count: 1225 pressure: 2449 slab_size: 524288 usage: 642777088 maximum: 0 flags: 0x20000000 cookie: 0x00000000 resize entry don't wait: 0x00000000 resize entry can wait: 0x00000000 slab chunk size used offset free empty: 0x828af300 0xde200000 256 256 0 0xde27e7fc 0x828af240 0xde500000 256 256 0 0xde559ffc 0x828af1c0 0xde700000 256 256 0 0xde730ffc [...] partial: 0x828059e0 0x82d00000 256 245 0 0x82d16ffc depot: full: 0x00000000, count 4 empty: 0x00000000, count 0 max full: 4 capacity: 8 stores: [0] loaded: 0x00000000 previous: 0x00000000 [1] loaded: 0x00000000 previous: 0x00000000 [2] loaded: 0x00000000 previous: 0x00000000 [3] loaded: 0x00000000 previous: 0x00000000 [4] loaded: 0x00000000 previous: 0x00000000 [5] loaded: 0x00000000 previous: 0x00000000 [6] loaded: 0x00000000 previous: 0x00000000 [7] loaded: 0x00000000 previous: 0x00000000 }}} Truncated the free slabs lists -- there are loads of them. Apparently something goes wrong in the slab allocator's low resource handler. Will look into this tomorrow. -- Ticket URL: <http://dev.haiku-os.org/ticket/5816#comment:5> Haiku <http://dev.haiku-os.org> Haiku - the operating system.