Go to the FreeLists Home Page Home Signup Help Login
 



[openbeosstorage] || [Date Prev] [06-2003 Date Index] [Date Next] || [Thread Prev] [06-2003 Thread Index] [Thread Next]

[openbeosstorage] DiskDevice API 2.x, Kernelland Draft

  • From: "Ingo Weinhold" <bonefish@xxxxxxxxxxxxxxx>
  • To: "Storage Kit" <openbeosstorage@xxxxxxxxxxxxx>
  • Date: Wed, 04 Jun 2003 00:57:55 +0200 CEST
Yo!

Lo and behold, after long I managed to get the my ideas for the 
kernelland implementation into a presentable form, hopefully being 
relatively consistent (feel free to prove me wrong :-). There's a 
collection of headers approximating most of the relevant interfaces. 
They have some holes and are annotated with TODOs, questions and 
remarks (not that I used them differently... ;-), but should give some 
good idea of what I have in mind. To give you a bit more to read, here 
comes a short introduction and discussion of different aspects.

The headers can be found here:
  http://tfs.cs.tu-berlin.de/~bonefish/private/DiskDeviceAPI-
Kernel2.0.zip

CU, Ingo


Disk Device Manager
Draft

0. Introduction

The general approach remains the same as it is in the current (v. 1.x 
implementation). There is a central entity -- now the kernel, not 
longer the registrar -- which maintains an up to date list of disk 
devices. At any time the userland API can retrieve a snapshot of a disk 
device's structure, now not longer via message communication with the 
registrar, but more directly using syscalls.

A snapshot (BDiskDevice) retrieved from the kernel remains unchanged 
unless explicitly updated. It is only possible to update the whole disk 
device, not individual partitions (at least as made available by the 
API), so that inconsistent states of the BDiskDevice hierarchy are 
avoided.

Moreover always a complete BDiskDevice has to be retrieved. It is not 
possible to get just a BPartition (and descendants). That should be 
possible in theory, but a design decision had been made to not allow 
that, for it would probably complicate a couple of things.

So far that's not new. The intriguing new aspect is how to deal with 
making changes to the disk devices/partitions. We decided to provide 
some kind of locking mechanism (BPartition::Lock()/Unlock()) which 
enables the API user to prevent others from interfering with his/her 
modifications. All changes would be made locally only and would be 
committed to the kernel for application via 
BDiskDevice::CommitModifications().

Unfortunately keeping changes locally clashes with aspects of our 
userland API and possible module APIs. E.g. if a new partition is 
created locally, this partition doesn't have an ID, since IDs are 
assigned by the kernel. Even if we provide a syscall that returns a 
fresh ID, that ID has no meaning in the kernel -- usally IDs are mapped 
to kernel structures. So, at least for all new and modified partitions 
we would need to provide the partition data with all BDiskDevice 
requests that involve a partition. In practice we certainly wouldn't 
get around passing the complete disk device hierarchy.

The alternative approach I'm proposing is based on immediately 
submitting all changes to the kernel. Since we wanted a locking 
mechanism anyway, there will never be more than one API user at a time 
modifying a partition. For simplicity I decided that it should be 
sufficient to do locking on a per disk device basis. It will still 
allow for instance that, if a resizing job is in progress, sibling 
partitions can be modified. Now, the locking method -- I would call it 
BDiskDevice::PrepareModifications() -- causes the disk device 
representation in the kernel (KDiskDevice) and all its partition 
(KPartition) to be cloned. Lacking a better name I called the partition 
clones shadow partitions. All modifications requested by the userland 
API are made to the shadow partitions.

When BDiskDevice::CommitModifications() (the bracketing method for 
PrepareModifications()) is invoked, the shadow hierarchy is compared 
with the original one and respective jobs (KDiskDeviceJob) are created 
and scheduled (put into a KDiskDeviceJobQueue). The shadows are deleted 
and the jobs will successively modify the original partitions. 
Partitions that are subject to change are marked busy, and ancestors of 
them descendant-busy. Although a subsequent PrepareModifications() on 
said disk device will succeed, busy/descendant-busy partitions cannot 
yet be modified. As soon as a partition is not longer marked busy, a 
shadow partition will be created and modifications to it will be 
possible again. Another CommitModifications() invocation will create 
more jobs, which are put into a new job queue being carried out in 
parallel with the current jobs.

OK, so much for the general introduction. The following sections will 
focus on the various interfaces.


1. Userland <-> Kernelland Interface

This interface is given by syscalls and the structures used by them. 
Confer the file `userland_interface.h'.


1.1 Identification of Entities

The following entities are referred to by IDs/tokens:

* partitions (ID of a disk device == ID of the partition it implies)
* disk systems
* jobs

The three sets of IDs are independent from each other.


1.2 Structures

The structures user_partition_data, user_disk_device_data, and 
user_partitionable_space_data are used to represent partition, disk 
device, and partitionable space data respectively. Note that these 
structures are not flat. They don't need to be, because of the way they 
are passed from kernel to userland. In theory they could directly be 
used for the internal representation of the BDiskDevice/BPartition 
data. In this case user_partition_data::user_data would become handy --
 it would contain a pointer to the BPartition object (hence no separate 
structure would be needed to represent the BPartition child 
relationship).

user_disk_device_job_info is flat. It contains the static info on a 
job. Progress and status must be gotten explicitly.


1.3 Retrieving Data

1.3.1 Disk Devices

Disk device data can be retrieved via get_disk_device_data(). If a 
buffer large enough is supplied, the disk device manager stores the 
complete user_disk_device_data/user_partition_data tree into it. 
Otherwise it fails and returns the required buffer size in neededSize. 
The caller should allocate a buffer of that size and retry. Since 
there's no locking or something like that, that prevents the kernel 
structures from being changed, the second call might fail again. The 
game continues until it succeeds or some other error occurs. The 
alternative would be to make locking of the kernel structures available 
to the userland. I don't really like that idea, though.

Currently there's also a get_partition_data(). Maybe it should better 
be removed, for we might risk inconsistent BDiskDevice trees, if we 
aren't very careful.

get_partitionable_spaces() works according to the same trial and error 
strategy.


1.3.2 Disk Systems

find_disk_system() and get_next_disk_system() are used to find 
respectively iterate through disk systems. Since the BDiskSystem 
doesn't hold more data than the ID and the name of the disk system, no 
more data is retrieved. All other BDiskSystem methods are mapped to the 
multiplex syscalls supports_partition_operation() and 
validate_partition_operation().


1.3.3 Disk Device Jobs

get_next_disk_device_job_info() provides a means to iterate through the 
active disk device jobs (BDiskDeviceRoster::BDiskDeviceRoster()), 
get_disk_device_job_info() retrieves the info for a given job ID, and 
get_disk_device_job_status() returns the current status and progress of 
a job.


1.4 Modifications

The C++ API is mapped in a straight forward way to respective syscalls. 
We have the `meta' calls prepare_disk_device_modifications(), 
commit_disk_device_modifications(), cancel_disk_device_modifications(), 
and is_disk_device_modified() (corresponding to 
BDiskDevice::CommitModifications(),...), and a syscall per modifying 
BPartition method.

initialize_partition() does perhaps need a bit more discussion, since 
there exists the planned fs_initialize_volume() function (<be/kernel/
fs_volume.h>), which has largely intersecting functionality (cf. 
userland_interface.h for some more thoughts).


2. disk_device_manager

The disk device manager maintains besides the disk device partition 
hierarchies and their shadows also a list of the available disk systems 
-- live updated using node monitoring -- and the active and pending 
disk device jobs.


2.1 Disk Devices

The classes KDiskDevice and KPartition are similar to those in 
userland. They hold all associated data, i.e. those also available in 
userland, plus pointers to shadow partition and disk system, as well as 
cookies used by the partition modules/FS add-ons.

The disk device manager keeps the list of disk devices up to date using 
polling for media change checks. The other updates are retrieved by the 
kernel internal notification mechanism (the callback idea we discussed) 
-- that would cover mount points, addition/removal of devices, and 
[un]mounting of volumes.

Unlike currently implemented in the registrar, I would not maintain a 
separate list of mounted volumes, but simply a sufficiently fast volume 
ID (dev_t) to KPartition mapping. Maybe it makes sense to integrate the 
subcomponent responsible for the volume management into the disk device 
manager -- no idea how that currently looks like, though.


2.2 Disk Systems

A complete list of the available disk systems is always in memory 
(containing basically the ID, and name of the system). A disk system is 
represented by a KDiskSystem object. KDiskSystem is abstract; it 
defines only the general interface for communication with a disk 
system. The concrete subclasses KFileSystem and KPartitioningSystem 
provide the link to the FS add-ons and partition modules. They map the 
virtual functions to the FS respectively the module API described 
below. The partition modules/FS add-ons of the systems in use are 
always kept loaded, the others are loaded on request. For more details 
see 3.1.


2.3 Disk Device Jobs

Disk device jobs are represented by an abstract KDiskDeviceJob class. 
Derived classes (K{Move,Resize,...}Job) provide the actual 
implementation. A KDiskDeviceJobQueue bundles a set of jobs for a disk 
device. At a time an arbitrary number of job queues can exist per 
device. The method KDiskDeviceJobQueue::Execute() spawns a new thread 
and begins to carry out the jobs in the queue one after another.

The methods KDiskDeviceJob features should be quite clear. Except 
ScopeID() maybe. It returns the ID of the partition closest to the root 
of the partition tree which might be affected by this job. E.g. when 
resizing or moving a partition, that would be the parent of the 
modified partition. Every descendant of the scope partition 
(inclusively) is marked busy as long as the job exists (i.e. is 
scheduled or in progress), and all ancestors are marked descendant-
busy.

As written in KDiskDeviceJob.h, I'm not sure if we want to add a 
Cancel() method to cancel a job in progress. Scheduled jobs can easily 
be canceled by removing them from the job queue. I just realize, that 
we don't provide any methods in the userland API for canceling a job.

KDiskDeviceJobFactory is a simple factory for creating the jobs.

Oh, and there's KScanPartitionJob, which, unlike the other job classes, 
doesn't write anything to the disk device, but scans a device for 
partitions/file systems. I thought, it would be sort of fun to use the 
same interface. :-)


2.4 Disk Device Manager

The disk device manager is encapsulated in the singleton class 
KDiskDeviceManager. There are already sets of methods for disk device 
and partition, job and disk system management. Definitely missing are 
methods for the watching service it shall provide. More precisely the 
registration of hooks for kernel internal usage. Sending messages to 
the userland would be done via the registrar, as discussed a while ago.

Furthermore the connection to/cooperation with the entity that is 
responsible for managing mounted volumes is missing, as well as 
anything concerning incoming notifications (partition mounted/unmount, 
device added/removed,...). A service thread would listen to 
notifications and handle them accordingly.


2.5 Locking of KDiskDevices and KPartitions

Locking is a bit hairy. I considered a couple of alternatives, that, I 
decided, I liked even less. However, now it is intended to work like 
this:

1) One can read/write lock the structures on a per disk device basis.

2) A read lock ensures, that the KDiskDevice data and of all its 
KPartitions remain unchanges.

3) A write lock allows modifying any data, but the hierarchy. Hierarchy 
modifications additionally require locking the disk device manager (the 
BPartition methods would do that). That ensures, that the disk device 
manager lock owner can traverse the disk device hierarchy.

4) KPartition implements a reference counting (Register(), 
Unregister(), requiring a locked manager). An object won't be deleted 
before its reference count is 0. Introducing this mechanism was 
necessary, since otherwise there would be the chance, that the 
BDiskDevice one is going to invoke {Read,Write}Lock() on is just being 
deleted. The only other way to ensure, that an object is not deleted, 
would be to lock the disk device manager, but in combination with 3) 
that would smell terribly of a dead lock.

5) Unless there are good reasons against this, for simplicity a shadow 
KDiskDevice will use the locking of its original KDiskDevice.


3. Partition Module/FS Add-On Interface

The interfaces for both the partition modules and the FS add-ons are 
pure C. (For the FS add-ons this was more or less a requirement, since 
the rest of their interface is C anyway.) They are defined in 
ddm_modules.h. The C disk device manager functions they can use are 
listed in disk_device_manager.h. However, in theory there should be 
nothing preventing the implementor from using the C++ interface 
(KDiskDeviceManager and friends), save not being usable from the boot 
loader maybe (see 3.5).

The hook functions making up the interfaces correspond one to one to 
KDiskSystem methods. Therefore I'll first have a closer look at these 
methods and afterwards say some words about how they are mapped to 
hooks for partition modules and FS add-ons respectively.


3.1 KDiskSystem

The interesting methods can be divided into three groups of 
functionality: scanning, querying and writing.


3.1.1 Scanning

Scanning means identification of on-disk partitions and retrieving 
information about them. This covers quite exactly the functionality 
already implemented for the disk_scanner stuff. Identify() asks a disk 
system whether it knows about the format of a given partition. It 
returns a priority, a float value between 0 and 1, indicating how good 
the disk system thinks it can deal with the partition (or a value < 0, 
if it has no clue), and a cookie that can hold arbitrary data helping 
to speed up the further process. The disk system with the highest 
priority is asked to Scan() the partition. It fills out missing data of 
the KPartition (name, type, block size, flags,...) and, in case of a 
partitioning system, adds KPartitions for the subpartitions. The cookie 
returned by Identify() is freed in Scan(); for the other disk systems 
FreeIdentifyCookie() is invoked.

Currently Scan() might set a content cookie for the supplied partition 
and a cookie for each created subpartition. When freeing the KPartition 
objects (or initializing with another disk system) FreeCookie() and 
FreeContentCookie() are invoked to free those cookies. I'm not sure, if 
these cookies are such a good idea. They should definitely contain no 
additional information -- all information that cannot be explicitely 
represented by KPartition are to go into the parameters and content 
parameters fields. The intention was, that cookies could speed up 
requests, for data could be represented in a more convenient way.

When invoking the scanning methods, the concerned disk device must be 
write locked.


3.1.2 Querying

Querying is related to the BDiskSystem capability and validation 
requests. The hooks do not work with on-disk partitions, but only with 
in-memory representations -- thus allowing requests regarding 
partitions that do not yet exist or being modified versions of the ones 
on-disk.

There are the Supports*() and Validate*() methods also known from 
BDiskSystem plus a few more, that make sense in the kernel. 
Additionally we have CountPartitionableSpaces() and 
GetPartitionableSpaces() with the obvious meaning.

The concerned disk device must be read locked, when querying methods 
are invoked.


3.1.3 Writing

The writing methods operate on on-disk partitions. They also modify the 
in-memory representations accordingly, though. Each methods is passed 
the KDiskDeviceJob, which called the method. It is used to report 
progress.

The concerned disk device should not be locked, when a writing method 
is invoked. Otherwise the device would be locked the whole time, the 
operation is in progress, which could be rather long. That is, in this 
case the callee is responsible for proper locking. The supplied 
KPartitions are registered, so at least they won't be deleted. It is 
not to be expected, that someone else is going to delete the object, 
anyway, since jobs are basically the only entities doing that (and 
their scopes of operation are disjoint). The only exception is, that 
the media is ejected by the user or the device is unplugged, which 
we'll hopefully handle gracefully.


3.2 Partition Modules

The KDiskSystem methods are mapped to C functions in a straight forward 
manner, i.e. for each method a function exists that takes a C 
representation of what is passed to the former ones. Thus the scanning 
and querying functions are passed partition_data structures instead of 
KPartition objects. The writing functions only get partition IDs, 
indicating the need for explicit locking, and job IDs instead of 
KDiskDeviceJob objects.

I think, only KDiskSystem::Defragment() has no counterpart, since it 
wouldn't make much sense for partitioning systems. 
get_partitionable_spaces() serves both 
{Count,Get}PartitionableSpaces().

To save some functions, the supports_() and validate_*() functions 
could be multiplexed to supports_partition_operation() and 
validate_partition_operation(). A proposal, what the parameters for 
these functions could be, is given in supports_validates_parameters.h.

Since access to the device is needed for some of the scanning and all 
of the writing functions, a device file descriptor is supplied. Instead 
of a file descriptor for the partition in question might even be 
better, though the device FD is available all the time anyway, since we 
need to keep the device open.


3.3 FS Add-Ons

The FS interface functions are similar to the ones of partition module 
interface, except for a few differences: To fit better into the rest of 
the FS add-on interface, none of functions is passed a file descriptor, 
but a path of the partition device instead. I also omitted the 
partition ID for the writing functions, since the path of the partition 
is as good an identifier as the ID is.


3.4 Exported Functions

disk_device_manager.h exports the C representions of the interesting 
structures and a set of C functions, which are certainly being needed 
by the partition module and FS add-on implementations. The structures 
should be clear, I think, so I directly come to the functions.

The first block are the functions for locking a device. The only thing 
to say about them is, that the supplied ID is the ID of any partition 
on the device. This should be quite convenient. Then we have functions 
to get a disk device or partition ID for a device path. Moreover there 
are functions to traverse the partition hierarchy and two to create/
delete partitions. partition_modified() tells the disk device manager, 
that a partition's data have been modified (instead a flag in 
partition_data::flags could be set, maybe). And last but not least, 
there are some job related functions.


3.5 Boot Loader

Reusing a partition module/FS add-on implementation for the boot loader 
shouldn't be too hard, I think. Only the scanning functions would need 
to be provided, and only a subset of what can be found in 
disk_device_manager.h needs to be exported by the boot loader. Since 
there's certainly only one thread of execution, locking won't be needed 
and thus {write,read}_lock_disk_device() are semantically identical 
with get_disk_device(), while the {write,read}_unlock_disk_device() are 
no-ops. The job related functions aren't of any interest either. So it 
boils down to not too big an interface.






[ Home | Signup | Help | Login | Archives | Lists ]

All trademarks and copyrights within the FreeLists archives are owned by their respective owners.
Everything else ©2007 Avenir Technologies, LLC.