#5138: PANIC: remove page 0x82868278 from cache 0xd19e7680: page still has mappings! ---------------------------+------------------------------------------------ Reporter: mmadia | Owner: bonefish Type: bug | Status: new Priority: normal | Milestone: R1 Component: System/Kernel | Version: R1/Development Keywords: | Blockedby: Platform: All | Blocking: ---------------------------+------------------------------------------------ Comment(by bonefish): Tracking this bug down will require a bit of work with kernel tracing. What is currently available might not even suffice, in which case we'll need to add some more, but let's first try with what we have (>= r34751). Please change the following in build/user_config_headers/tracing_config.h (copy from build/config_headers/): - ENABLE_TRACING to 1. - MAX_TRACE_SIZE to (100 * 1024 * 1024) or more. Unless you've less than 256 MB. In that case try 10 or 20 MB and see whether that suffices. - SYSCALL_TRACING to 1. - TEAM_TRACING to 1. - VM_CACHE_TRACING to 2. - VM_CACHE_TRACING_STACK_TRACE to 10. - VM_PAGE_FAULT_TRACING to 1. When the panic occurs, there are quite a few interesting basic information to retrieve: - As always the stack trace. If it is similar to the two attached ones -- i.e. goes through "vm_clone_area (nearest)" (which is actually delete_area()) -- a "call 16 -2" (16: the line number in the stack trace, 2: the number of arguments we want to see) will yield the function's parameters: the address of the address space and of the area (in case of gcc 4 they might be invalid). An "area address <area address>" should produce information on the area. - If the thread is a userland thread the recent syscall history of the thread might be interesting: "traced 0 -10 -1 and thread <thread ID> or #syscall #team". - Info on the cache: "cache <cache address>" (cache address as in the panic message). - Info on the page: "page <page address>" (page address as in the panic message). Since the page has mappings, there should be at least one entry under "area mappings:". Use "area address <area address>" with the address listed to get information on the area. A cache will be listed. Use "cache <cache address>" and "cache_tree <cache address>". This was the straight forward part. The rest is working with the tracing entries, which depends mainly on information retrieved so far and further information from the tracing. I'll best give a bit of background first: An area has a (VM) cache which is the top of a stack of caches (actually part of a tree, but let's keep it simple). A cache contains pages. A page represents a page of physical memory and can only live in at most one cache. An area can map the any pages of the caches in its cache stack -- those and only those. The panic this ticket is about is triggered when a no longer used cache is destroyed although it contains a page that is still mapped by some area. Normally that cannot happen, since an area that is still used has references to its cache (indirectly to all caches in the stack) and therefore the cache would not be destroyed. There are two kinds of bugs that could trigger the panic: The cache reference counting is broken and the cache, although still in use, is destroyed. Or the page in question has been erroneously moved to another cache respectively the complete cache has been moved. Since the ref counting is relatively simple, I'd consider the latter far more likely. With the information retrieved above that can be easily verified: In the expected case the output of the "cache_tree" for the cache of the area the page is still mapped by will not contain the cache that is currently being destroyed. IOW the area has mapped a page that is not in one of its caches anymore. It is relatively save to assume, that at the time when the page was mapped it was still in the right cache (i.e. one referenced by the area). The first step to understand the bug is to find out at which point exactly the page left the "legal reach" of the area, i.e. when it was removed from a cache in the area's stack and was inserted into an unrelated one, respectively when the cache containing the page was removed from the area's cache stack. In either case the approach is the same: First find the point at which the page was first inserted into a cache of the area, then trace the page and the caches it lives in until hitting the main point of interest. Both can be a bit tedious, since there can be quite a bit of fluctuation wrt. the caches of an area (fork()s and mmap()s can add new caches, exec*() and area deletion can cause removal of caches). If the area in question (still talking about the one with the page mapping) is a userland area, its point of creation should be easy to find, since there are only two syscalls that create areas: {{{ traced 0 -1 -1 and and #syscall #<area ID> or #create_area #map_file }}} <area ID> must be the area ID in hexadecimal notation with leading "0x". Note the tracing entry index and search forward from that point looking for the page: {{{ traced <index> 10 -1 and #page #<page address> }}} We're looking for entries saying "vm cache insert page:...". Start with the first entry: It needs to be verified that the printed cache belongs to the area at this point: {{{ cache_stack area <area address> <index2> }}} If the cache isn't part of the printed list, continue with the next entry, until finding a cache that is (always use the index of the tracing entry that is examined). After having found the point where the page was inserted into one of the area's caches, both the cache and the page need to be tracked forward: {{{ traced <index2> 10 -1 and "#vm cache" #<cache address> }}} This will turn up tracing entries that involve the cache. The interesting ones are: "vm cache remove page:..." for our page and "vm cache remove consumer:..." with another cache being the consumer. If the page is removed from the cache, find out where it is inserted next: {{{ traced <index3> 10 -1 and "#vm cache" #<page address> }}} It must be verified that the new cache is in our area's stack at that point. If a consumer is removed from the cache, it must be checked whether the removed consumer is in the area's cache stack at that point (if not, the entry can be ignored) and if so, whether shortly after another cache of the area's stack is added as a consumer again. Finding out whether a cache is in the area's cache stack at a point works just as above, i.e. run {{{ cache_stack area <area address> <index4> }}} and check whether the cache's address is printed. If the cache is not in the area's cache stack, we have found the point where things go wrong. Otherwise continue following the cache containing the page. At the point of interest (i.e. the tracing entry where the page or its containing cache was removed for the area's cache stack) -- let's call it <index5> -- a bit of context would be interesting: {{{ traced --stacktrace <index5> -20 traced --stacktrace <index5> 20 }}} (will probably be longish) Please attach the complete KDL session. A general hint for capturing KDL sessions: Make sure the serial output is processed by a terminal, since otherwise edited prompt lines will be hard to read. I.e. if you direct the output directly into a file, when you're done "cat" the file in a terminal and copy the terminal lines (the history must be big enough, of course). -- Ticket URL: <http://dev.haiku-os.org/ticket/5138#comment:5> Haiku <http://dev.haiku-os.org> Haiku - the operating system.