
|
[openbeosstorage]
||
[Date Prev]
[06-2003 Date Index]
[Date Next]
||
[Thread Prev]
[06-2003 Thread Index]
[Thread Next]
[openbeosstorage] DiskDevice API 2.x, Kernelland Draft
- From: "Ingo Weinhold" <bonefish@xxxxxxxxxxxxxxx>
- To: "Storage Kit" <openbeosstorage@xxxxxxxxxxxxx>
- Date: Wed, 04 Jun 2003 00:57:55 +0200 CEST
Yo!
Lo and behold, after long I managed to get the my ideas for the
kernelland implementation into a presentable form, hopefully being
relatively consistent (feel free to prove me wrong :-). There's a
collection of headers approximating most of the relevant interfaces.
They have some holes and are annotated with TODOs, questions and
remarks (not that I used them differently... ;-), but should give some
good idea of what I have in mind. To give you a bit more to read, here
comes a short introduction and discussion of different aspects.
The headers can be found here:
http://tfs.cs.tu-berlin.de/~bonefish/private/DiskDeviceAPI-
Kernel2.0.zip
CU, Ingo
Disk Device Manager
Draft
0. Introduction
The general approach remains the same as it is in the current (v. 1.x
implementation). There is a central entity -- now the kernel, not
longer the registrar -- which maintains an up to date list of disk
devices. At any time the userland API can retrieve a snapshot of a disk
device's structure, now not longer via message communication with the
registrar, but more directly using syscalls.
A snapshot (BDiskDevice) retrieved from the kernel remains unchanged
unless explicitly updated. It is only possible to update the whole disk
device, not individual partitions (at least as made available by the
API), so that inconsistent states of the BDiskDevice hierarchy are
avoided.
Moreover always a complete BDiskDevice has to be retrieved. It is not
possible to get just a BPartition (and descendants). That should be
possible in theory, but a design decision had been made to not allow
that, for it would probably complicate a couple of things.
So far that's not new. The intriguing new aspect is how to deal with
making changes to the disk devices/partitions. We decided to provide
some kind of locking mechanism (BPartition::Lock()/Unlock()) which
enables the API user to prevent others from interfering with his/her
modifications. All changes would be made locally only and would be
committed to the kernel for application via
BDiskDevice::CommitModifications().
Unfortunately keeping changes locally clashes with aspects of our
userland API and possible module APIs. E.g. if a new partition is
created locally, this partition doesn't have an ID, since IDs are
assigned by the kernel. Even if we provide a syscall that returns a
fresh ID, that ID has no meaning in the kernel -- usally IDs are mapped
to kernel structures. So, at least for all new and modified partitions
we would need to provide the partition data with all BDiskDevice
requests that involve a partition. In practice we certainly wouldn't
get around passing the complete disk device hierarchy.
The alternative approach I'm proposing is based on immediately
submitting all changes to the kernel. Since we wanted a locking
mechanism anyway, there will never be more than one API user at a time
modifying a partition. For simplicity I decided that it should be
sufficient to do locking on a per disk device basis. It will still
allow for instance that, if a resizing job is in progress, sibling
partitions can be modified. Now, the locking method -- I would call it
BDiskDevice::PrepareModifications() -- causes the disk device
representation in the kernel (KDiskDevice) and all its partition
(KPartition) to be cloned. Lacking a better name I called the partition
clones shadow partitions. All modifications requested by the userland
API are made to the shadow partitions.
When BDiskDevice::CommitModifications() (the bracketing method for
PrepareModifications()) is invoked, the shadow hierarchy is compared
with the original one and respective jobs (KDiskDeviceJob) are created
and scheduled (put into a KDiskDeviceJobQueue). The shadows are deleted
and the jobs will successively modify the original partitions.
Partitions that are subject to change are marked busy, and ancestors of
them descendant-busy. Although a subsequent PrepareModifications() on
said disk device will succeed, busy/descendant-busy partitions cannot
yet be modified. As soon as a partition is not longer marked busy, a
shadow partition will be created and modifications to it will be
possible again. Another CommitModifications() invocation will create
more jobs, which are put into a new job queue being carried out in
parallel with the current jobs.
OK, so much for the general introduction. The following sections will
focus on the various interfaces.
1. Userland <-> Kernelland Interface
This interface is given by syscalls and the structures used by them.
Confer the file `userland_interface.h'.
1.1 Identification of Entities
The following entities are referred to by IDs/tokens:
* partitions (ID of a disk device == ID of the partition it implies)
* disk systems
* jobs
The three sets of IDs are independent from each other.
1.2 Structures
The structures user_partition_data, user_disk_device_data, and
user_partitionable_space_data are used to represent partition, disk
device, and partitionable space data respectively. Note that these
structures are not flat. They don't need to be, because of the way they
are passed from kernel to userland. In theory they could directly be
used for the internal representation of the BDiskDevice/BPartition
data. In this case user_partition_data::user_data would become handy --
it would contain a pointer to the BPartition object (hence no separate
structure would be needed to represent the BPartition child
relationship).
user_disk_device_job_info is flat. It contains the static info on a
job. Progress and status must be gotten explicitly.
1.3 Retrieving Data
1.3.1 Disk Devices
Disk device data can be retrieved via get_disk_device_data(). If a
buffer large enough is supplied, the disk device manager stores the
complete user_disk_device_data/user_partition_data tree into it.
Otherwise it fails and returns the required buffer size in neededSize.
The caller should allocate a buffer of that size and retry. Since
there's no locking or something like that, that prevents the kernel
structures from being changed, the second call might fail again. The
game continues until it succeeds or some other error occurs. The
alternative would be to make locking of the kernel structures available
to the userland. I don't really like that idea, though.
Currently there's also a get_partition_data(). Maybe it should better
be removed, for we might risk inconsistent BDiskDevice trees, if we
aren't very careful.
get_partitionable_spaces() works according to the same trial and error
strategy.
1.3.2 Disk Systems
find_disk_system() and get_next_disk_system() are used to find
respectively iterate through disk systems. Since the BDiskSystem
doesn't hold more data than the ID and the name of the disk system, no
more data is retrieved. All other BDiskSystem methods are mapped to the
multiplex syscalls supports_partition_operation() and
validate_partition_operation().
1.3.3 Disk Device Jobs
get_next_disk_device_job_info() provides a means to iterate through the
active disk device jobs (BDiskDeviceRoster::BDiskDeviceRoster()),
get_disk_device_job_info() retrieves the info for a given job ID, and
get_disk_device_job_status() returns the current status and progress of
a job.
1.4 Modifications
The C++ API is mapped in a straight forward way to respective syscalls.
We have the `meta' calls prepare_disk_device_modifications(),
commit_disk_device_modifications(), cancel_disk_device_modifications(),
and is_disk_device_modified() (corresponding to
BDiskDevice::CommitModifications(),...), and a syscall per modifying
BPartition method.
initialize_partition() does perhaps need a bit more discussion, since
there exists the planned fs_initialize_volume() function (<be/kernel/
fs_volume.h>), which has largely intersecting functionality (cf.
userland_interface.h for some more thoughts).
2. disk_device_manager
The disk device manager maintains besides the disk device partition
hierarchies and their shadows also a list of the available disk systems
-- live updated using node monitoring -- and the active and pending
disk device jobs.
2.1 Disk Devices
The classes KDiskDevice and KPartition are similar to those in
userland. They hold all associated data, i.e. those also available in
userland, plus pointers to shadow partition and disk system, as well as
cookies used by the partition modules/FS add-ons.
The disk device manager keeps the list of disk devices up to date using
polling for media change checks. The other updates are retrieved by the
kernel internal notification mechanism (the callback idea we discussed)
-- that would cover mount points, addition/removal of devices, and
[un]mounting of volumes.
Unlike currently implemented in the registrar, I would not maintain a
separate list of mounted volumes, but simply a sufficiently fast volume
ID (dev_t) to KPartition mapping. Maybe it makes sense to integrate the
subcomponent responsible for the volume management into the disk device
manager -- no idea how that currently looks like, though.
2.2 Disk Systems
A complete list of the available disk systems is always in memory
(containing basically the ID, and name of the system). A disk system is
represented by a KDiskSystem object. KDiskSystem is abstract; it
defines only the general interface for communication with a disk
system. The concrete subclasses KFileSystem and KPartitioningSystem
provide the link to the FS add-ons and partition modules. They map the
virtual functions to the FS respectively the module API described
below. The partition modules/FS add-ons of the systems in use are
always kept loaded, the others are loaded on request. For more details
see 3.1.
2.3 Disk Device Jobs
Disk device jobs are represented by an abstract KDiskDeviceJob class.
Derived classes (K{Move,Resize,...}Job) provide the actual
implementation. A KDiskDeviceJobQueue bundles a set of jobs for a disk
device. At a time an arbitrary number of job queues can exist per
device. The method KDiskDeviceJobQueue::Execute() spawns a new thread
and begins to carry out the jobs in the queue one after another.
The methods KDiskDeviceJob features should be quite clear. Except
ScopeID() maybe. It returns the ID of the partition closest to the root
of the partition tree which might be affected by this job. E.g. when
resizing or moving a partition, that would be the parent of the
modified partition. Every descendant of the scope partition
(inclusively) is marked busy as long as the job exists (i.e. is
scheduled or in progress), and all ancestors are marked descendant-
busy.
As written in KDiskDeviceJob.h, I'm not sure if we want to add a
Cancel() method to cancel a job in progress. Scheduled jobs can easily
be canceled by removing them from the job queue. I just realize, that
we don't provide any methods in the userland API for canceling a job.
KDiskDeviceJobFactory is a simple factory for creating the jobs.
Oh, and there's KScanPartitionJob, which, unlike the other job classes,
doesn't write anything to the disk device, but scans a device for
partitions/file systems. I thought, it would be sort of fun to use the
same interface. :-)
2.4 Disk Device Manager
The disk device manager is encapsulated in the singleton class
KDiskDeviceManager. There are already sets of methods for disk device
and partition, job and disk system management. Definitely missing are
methods for the watching service it shall provide. More precisely the
registration of hooks for kernel internal usage. Sending messages to
the userland would be done via the registrar, as discussed a while ago.
Furthermore the connection to/cooperation with the entity that is
responsible for managing mounted volumes is missing, as well as
anything concerning incoming notifications (partition mounted/unmount,
device added/removed,...). A service thread would listen to
notifications and handle them accordingly.
2.5 Locking of KDiskDevices and KPartitions
Locking is a bit hairy. I considered a couple of alternatives, that, I
decided, I liked even less. However, now it is intended to work like
this:
1) One can read/write lock the structures on a per disk device basis.
2) A read lock ensures, that the KDiskDevice data and of all its
KPartitions remain unchanges.
3) A write lock allows modifying any data, but the hierarchy. Hierarchy
modifications additionally require locking the disk device manager (the
BPartition methods would do that). That ensures, that the disk device
manager lock owner can traverse the disk device hierarchy.
4) KPartition implements a reference counting (Register(),
Unregister(), requiring a locked manager). An object won't be deleted
before its reference count is 0. Introducing this mechanism was
necessary, since otherwise there would be the chance, that the
BDiskDevice one is going to invoke {Read,Write}Lock() on is just being
deleted. The only other way to ensure, that an object is not deleted,
would be to lock the disk device manager, but in combination with 3)
that would smell terribly of a dead lock.
5) Unless there are good reasons against this, for simplicity a shadow
KDiskDevice will use the locking of its original KDiskDevice.
3. Partition Module/FS Add-On Interface
The interfaces for both the partition modules and the FS add-ons are
pure C. (For the FS add-ons this was more or less a requirement, since
the rest of their interface is C anyway.) They are defined in
ddm_modules.h. The C disk device manager functions they can use are
listed in disk_device_manager.h. However, in theory there should be
nothing preventing the implementor from using the C++ interface
(KDiskDeviceManager and friends), save not being usable from the boot
loader maybe (see 3.5).
The hook functions making up the interfaces correspond one to one to
KDiskSystem methods. Therefore I'll first have a closer look at these
methods and afterwards say some words about how they are mapped to
hooks for partition modules and FS add-ons respectively.
3.1 KDiskSystem
The interesting methods can be divided into three groups of
functionality: scanning, querying and writing.
3.1.1 Scanning
Scanning means identification of on-disk partitions and retrieving
information about them. This covers quite exactly the functionality
already implemented for the disk_scanner stuff. Identify() asks a disk
system whether it knows about the format of a given partition. It
returns a priority, a float value between 0 and 1, indicating how good
the disk system thinks it can deal with the partition (or a value < 0,
if it has no clue), and a cookie that can hold arbitrary data helping
to speed up the further process. The disk system with the highest
priority is asked to Scan() the partition. It fills out missing data of
the KPartition (name, type, block size, flags,...) and, in case of a
partitioning system, adds KPartitions for the subpartitions. The cookie
returned by Identify() is freed in Scan(); for the other disk systems
FreeIdentifyCookie() is invoked.
Currently Scan() might set a content cookie for the supplied partition
and a cookie for each created subpartition. When freeing the KPartition
objects (or initializing with another disk system) FreeCookie() and
FreeContentCookie() are invoked to free those cookies. I'm not sure, if
these cookies are such a good idea. They should definitely contain no
additional information -- all information that cannot be explicitely
represented by KPartition are to go into the parameters and content
parameters fields. The intention was, that cookies could speed up
requests, for data could be represented in a more convenient way.
When invoking the scanning methods, the concerned disk device must be
write locked.
3.1.2 Querying
Querying is related to the BDiskSystem capability and validation
requests. The hooks do not work with on-disk partitions, but only with
in-memory representations -- thus allowing requests regarding
partitions that do not yet exist or being modified versions of the ones
on-disk.
There are the Supports*() and Validate*() methods also known from
BDiskSystem plus a few more, that make sense in the kernel.
Additionally we have CountPartitionableSpaces() and
GetPartitionableSpaces() with the obvious meaning.
The concerned disk device must be read locked, when querying methods
are invoked.
3.1.3 Writing
The writing methods operate on on-disk partitions. They also modify the
in-memory representations accordingly, though. Each methods is passed
the KDiskDeviceJob, which called the method. It is used to report
progress.
The concerned disk device should not be locked, when a writing method
is invoked. Otherwise the device would be locked the whole time, the
operation is in progress, which could be rather long. That is, in this
case the callee is responsible for proper locking. The supplied
KPartitions are registered, so at least they won't be deleted. It is
not to be expected, that someone else is going to delete the object,
anyway, since jobs are basically the only entities doing that (and
their scopes of operation are disjoint). The only exception is, that
the media is ejected by the user or the device is unplugged, which
we'll hopefully handle gracefully.
3.2 Partition Modules
The KDiskSystem methods are mapped to C functions in a straight forward
manner, i.e. for each method a function exists that takes a C
representation of what is passed to the former ones. Thus the scanning
and querying functions are passed partition_data structures instead of
KPartition objects. The writing functions only get partition IDs,
indicating the need for explicit locking, and job IDs instead of
KDiskDeviceJob objects.
I think, only KDiskSystem::Defragment() has no counterpart, since it
wouldn't make much sense for partitioning systems.
get_partitionable_spaces() serves both
{Count,Get}PartitionableSpaces().
To save some functions, the supports_() and validate_*() functions
could be multiplexed to supports_partition_operation() and
validate_partition_operation(). A proposal, what the parameters for
these functions could be, is given in supports_validates_parameters.h.
Since access to the device is needed for some of the scanning and all
of the writing functions, a device file descriptor is supplied. Instead
of a file descriptor for the partition in question might even be
better, though the device FD is available all the time anyway, since we
need to keep the device open.
3.3 FS Add-Ons
The FS interface functions are similar to the ones of partition module
interface, except for a few differences: To fit better into the rest of
the FS add-on interface, none of functions is passed a file descriptor,
but a path of the partition device instead. I also omitted the
partition ID for the writing functions, since the path of the partition
is as good an identifier as the ID is.
3.4 Exported Functions
disk_device_manager.h exports the C representions of the interesting
structures and a set of C functions, which are certainly being needed
by the partition module and FS add-on implementations. The structures
should be clear, I think, so I directly come to the functions.
The first block are the functions for locking a device. The only thing
to say about them is, that the supplied ID is the ID of any partition
on the device. This should be quite convenient. Then we have functions
to get a disk device or partition ID for a device path. Moreover there
are functions to traverse the partition hierarchy and two to create/
delete partitions. partition_modified() tells the disk device manager,
that a partition's data have been modified (instead a flag in
partition_data::flags could be set, maybe). And last but not least,
there are some job related functions.
3.5 Boot Loader
Reusing a partition module/FS add-on implementation for the boot loader
shouldn't be too hard, I think. Only the scanning functions would need
to be provided, and only a subset of what can be found in
disk_device_manager.h needs to be exported by the boot loader. Since
there's certainly only one thread of execution, locking won't be needed
and thus {write,read}_lock_disk_device() are semantically identical
with get_disk_device(), while the {write,read}_unlock_disk_device() are
no-ops. The job related functions aren't of any interest either. So it
boils down to not too big an interface.
|

|