[haiku-bugs] Re: [Haiku] #5138: PANIC: remove page 0x82868278 from cache 0xd19e7680: page still has mappings!

#5138: PANIC: remove page 0x82868278 from cache 0xd19e7680: page still has
mappings!
---------------------------+------------------------------------------------
 Reporter:  mmadia         |       Owner:  bonefish      
     Type:  bug            |      Status:  new           
 Priority:  normal         |   Milestone:  R1            
Component:  System/Kernel  |     Version:  R1/Development
 Keywords:                 |   Blockedby:                
 Platform:  All            |    Blocking:                
---------------------------+------------------------------------------------

Comment(by bonefish):

 Tracking this bug down will require a bit of work with kernel tracing.
 What is currently available might not even suffice, in which case we'll
 need to add some more, but let's first try with what we have (>= r34751).
 Please change the following in build/user_config_headers/tracing_config.h
 (copy from build/config_headers/):
  - ENABLE_TRACING to 1.
  - MAX_TRACE_SIZE to (100 * 1024 * 1024) or more. Unless you've less than
 256 MB. In that case try 10 or 20 MB and see whether that suffices.
  - SYSCALL_TRACING to 1.
  - TEAM_TRACING to 1.
  - VM_CACHE_TRACING to 2.
  - VM_CACHE_TRACING_STACK_TRACE to 10.
  - VM_PAGE_FAULT_TRACING to 1.

 When the panic occurs, there are quite a few interesting basic information
 to retrieve:
  - As always the stack trace. If it is similar to the two attached ones --
 i.e. goes through "vm_clone_area (nearest)" (which is actually
 delete_area()) -- a "call 16 -2" (16: the line number in the stack trace,
 2: the number of arguments we want to see) will yield the function's
 parameters: the address of the address space and of the area (in case of
 gcc 4 they might be invalid). An "area address <area address>" should
 produce information on the area.
  - If the thread is a userland thread the recent syscall history of the
 thread might be interesting: "traced 0 -10 -1 and thread <thread ID> or
 #syscall #team".
  - Info on the cache: "cache <cache address>" (cache address as in the
 panic message).
  - Info on the page: "page <page address>" (page address as in the panic
 message). Since the page has mappings, there should be at least one entry
 under "area mappings:". Use "area address <area address>" with the address
 listed to get information on the area. A cache will be listed. Use "cache
 <cache address>" and "cache_tree <cache address>".

 This was the straight forward part. The rest is working with the tracing
 entries, which depends mainly on information retrieved so far and further
 information from the tracing. I'll best give a bit of background first: An
 area has a (VM) cache which is the top of a stack of caches (actually part
 of a tree, but let's keep it simple). A cache contains pages. A page
 represents a page of physical memory and can only live in at most one
 cache. An area can map the any pages of the caches in its cache stack --
 those and only those.

 The panic this ticket is about is triggered when a no longer used cache is
 destroyed although it contains a page that is still mapped by some area.
 Normally that cannot happen, since an area that is still used has
 references to its cache (indirectly to all caches in the stack) and
 therefore the cache would not be destroyed.

 There are two kinds of bugs that could trigger the panic: The cache
 reference counting is broken and the cache, although still in use, is
 destroyed. Or the page in question has been erroneously moved to another
 cache respectively the complete cache has been moved. Since the ref
 counting is relatively simple, I'd consider the latter far more likely.
 With the information retrieved above that can be easily verified: In the
 expected case the output of the "cache_tree" for the cache of the area the
 page is still mapped by will not contain the cache that is currently being
 destroyed. IOW the area has mapped a page that is not in one of its caches
 anymore. It is relatively save to assume, that at the time when the page
 was mapped it was still in the right cache (i.e. one referenced by the
 area).

 The first step to understand the bug is to find out at which point exactly
 the page left the "legal reach" of the area, i.e. when it was removed from
 a cache in the area's stack and was inserted into an unrelated one,
 respectively when the cache containing the page was removed from the
 area's cache stack. In either case the approach is the same: First find
 the point at which the page was first inserted into a cache of the area,
 then trace the page and the caches it lives in until hitting the main
 point of interest. Both can be a bit tedious, since there can be quite a
 bit of fluctuation wrt. the caches of an area (fork()s and mmap()s can add
 new caches, exec*() and area deletion can cause removal of caches).

 If the area in question (still talking about the one with the page
 mapping) is a userland area, its point of creation should be easy to find,
 since there are only two syscalls that create areas:
 {{{
 traced 0 -1 -1 and and #syscall #<area ID> or #create_area #map_file
 }}}
 <area ID> must be the area ID in hexadecimal notation with leading "0x".
 Note the tracing entry index and search forward from that point looking
 for the page:
 {{{
 traced <index> 10 -1 and #page #<page address>
 }}}
 We're looking for entries saying "vm cache insert page:...". Start with
 the first entry: It needs to be verified that the printed cache belongs to
 the area at this point:
 {{{
 cache_stack area <area address> <index2>
 }}}
 If the cache isn't part of the printed list, continue with the next entry,
 until finding a cache that is (always use the index of the tracing entry
 that is examined).

 After having found the point where the page was inserted into one of the
 area's caches, both the cache and the page need to be tracked forward:
 {{{
 traced <index2> 10 -1 and "#vm cache" #<cache address>
 }}}
 This will turn up tracing entries that involve the cache. The interesting
 ones are: "vm cache remove page:..." for our page and "vm cache remove
 consumer:..." with another cache being the consumer. If the page is
 removed from the cache, find out where it is inserted next:
 {{{
 traced <index3> 10 -1 and "#vm cache" #<page address>
 }}}
 It must be verified that the new cache is in our area's stack at that
 point. If a consumer is removed from the cache, it must be checked whether
 the removed consumer is in the area's cache stack at that point (if not,
 the entry can be ignored) and if so, whether shortly after another cache
 of the area's stack is added as a consumer again. Finding out whether a
 cache is in the area's cache stack at a point works just as above, i.e.
 run
 {{{
 cache_stack area <area address> <index4>
 }}}
 and check whether the cache's address is printed. If the cache is not in
 the area's cache stack, we have found the point where things go wrong.
 Otherwise continue following the cache containing the page.

 At the point of interest (i.e. the tracing entry where the page or its
 containing cache was removed for the area's cache stack) -- let's call it
 <index5> -- a bit of context would be interesting:
 {{{
 traced --stacktrace <index5> -20
 traced --stacktrace <index5> 20
 }}}
 (will probably be longish)

 Please attach the complete KDL session.

 A general hint for capturing KDL sessions: Make sure the serial output is
 processed by a terminal, since otherwise edited prompt lines will be hard
 to read. I.e. if you direct the output directly into a file, when you're
 done "cat" the file in a terminal and copy the terminal lines (the history
 must be big enough, of course).

-- 
Ticket URL: <http://dev.haiku-os.org/ticket/5138#comment:5>
Haiku <http://dev.haiku-os.org>
Haiku - the operating system.

Other related posts: