[openbeosstorage] Re: DiskDevice API 2.x, Kernelland Draft

Here's comments on the discussion. I'll try to address the inline 
questions and comments in the headers tomorrow.

All in all, I like it. :-) Aside from the few questions I have below, I 
say go ahead and start on it. I'm willing to help if you're interested; 
I have finals next week, so may not find too much time this weekend, 
but I bet I could dig up a little. I'd also like to finish up the 
kernel utils proposal I'm working on, though; those could be useful 
along the way. 

> 0. Introduction
[...]
> The alternative approach I'm proposing is based on immediately
> submitting all changes to the kernel. Since we wanted a locking
> mechanism anyway, there will never be more than one API user at a time
> modifying a partition. For simplicity I decided that it should be
> sufficient to do locking on a per disk device basis. It will still
> allow for instance that, if a resizing job is in progress, sibling
> partitions can be modified. Now, the locking method -- I would call it
> BDiskDevice::PrepareModifications() -- causes the disk device
> representation in the kernel (KDiskDevice) and all its partition
> (KPartition) to be cloned. Lacking a better name I called the 
> partition
> clones shadow partitions. All modifications requested by the userland
> API are made to the shadow partitions.
> 
> When BDiskDevice::CommitModifications() (the bracketing method for
> PrepareModifications()) is invoked, the shadow hierarchy is compared
> with the original one and respective jobs (KDiskDeviceJob) are created
> and scheduled (put into a KDiskDeviceJobQueue). 

This seems a bit odd to me. Shouldn't the shadow hierachy just keep 
track of the appropriate job queue necessary to reach the state of the 
shadow hierarchy from the state of the original hierachy? I don't see 
how one could easily infer a proper scheduling job sequence from just 
the beginning and ending states of the hierarchy.

> 1.2 Structures
> 
> The structures user_partition_data, user_disk_device_data, and
> user_partitionable_space_data are used to represent partition, disk
> device, and partitionable space data respectively. Note that these
> structures are not flat. They don't need to be, because of the way 
> they
> are passed from kernel to userland. In theory they could directly be
> used for the internal representation of the BDiskDevice/BPartition
> data. In this case user_partition_data::user_data would become handy 
> --
>  it would contain a pointer to the BPartition object (hence no 
>  separate
> structure would be needed to represent the BPartition child
> relationship).

I like that idea. Let's do that.

> Disk device data can be retrieved via get_disk_device_data(). If a
> buffer large enough is supplied, the disk device manager stores the
> complete user_disk_device_data/user_partition_data tree into it.
> Otherwise it fails and returns the required buffer size in neededSize.
> The caller should allocate a buffer of that size and retry. Since
> there's no locking or something like that, that prevents the kernel
> structures from being changed, the second call might fail again. The
> game continues until it succeeds or some other error occurs. The
> alternative would be to make locking of the kernel structures 
> available
> to the userland. I don't really like that idea, though.

No, the game idea seems fine to me. If it turns out to really not work 
well for some reason, we can bite the bullet and add locking.

> Currently there's also a get_partition_data(). Maybe it should better
> be removed, for we might risk inconsistent BDiskDevice trees, if we
> aren't very careful.

May as well; it seems perfectly reasonable to limit the API to getting 
whole devices, and definitely makes things simpler.

> 1.4 Modifications
> 
> The C++ API is mapped in a straight forward way to respective 
> syscalls.
> We have the `meta' calls prepare_disk_device_modifications(),
> commit_disk_device_modifications(), 
> cancel_disk_device_modifications(),
> and is_disk_device_modified() (corresponding to
> BDiskDevice::CommitModifications(),...), and a syscall per modifying
> BPartition method.

I don't undertand what the shadow parameter to these functions is for.

> initialize_partition() does perhaps need a bit more discussion, since
> there exists the planned fs_initialize_volume() function (<be/kernel/
> fs_volume.h>), which has largely intersecting functionality (cf.
> userland_interface.h for some more thoughts).

My vote is for ditching fs_initialize_volume() and adding support for 
registering files as disk devices. What would the arguments be for 
keeping fs_initialize_volume() (other than the regular file problem)?

> Unlike currently implemented in the registrar, I would not maintain a
> separate list of mounted volumes, but simply a sufficiently fast 
> volume
> ID (dev_t) to KPartition mapping. Maybe it makes sense to integrate 
> the
> subcomponent responsible for the volume management into the disk 
> device
> manager -- no idea how that currently looks like, though.

That seems like a perfectly reasonable idea to me (also having no idea 
how it currently looks :-).

> 2.3 Disk Device Jobs
> 
> Disk device jobs are represented by an abstract KDiskDeviceJob class.
> Derived classes (K{Move,Resize,...}Job) provide the actual
> implementation. A KDiskDeviceJobQueue bundles a set of jobs for a disk
> device. At a time an arbitrary number of job queues can exist per
> device. The method KDiskDeviceJobQueue::Execute() spawns a new thread
> and begins to carry out the jobs in the queue one after another.
> 
> The methods KDiskDeviceJob features should be quite clear. Except
> ScopeID() maybe. It returns the ID of the partition closest to the 
> root
> of the partition tree which might be affected by this job. E.g. when
> resizing or moving a partition, that would be the parent of the
> modified partition. Every descendant of the scope partition
> (inclusively) is marked busy as long as the job exists (i.e. is
> scheduled or in progress), and all ancestors are marked descendant-
> busy.
> 
> As written in KDiskDeviceJob.h, I'm not sure if we want to add a
> Cancel() method to cancel a job in progress. Scheduled jobs can easily
> be canceled by removing them from the job queue. I just realize, that
> we don't provide any methods in the userland API for canceling a job.

Yes, I think we should support cancel, for things like defragging that 
can be interrupted. But we should also allow the job to specify if it 
allows cancelling or not. 

> Oh, and there's KScanPartitionJob, which, unlike the other job 
> classes,
> doesn't write anything to the disk device, but scans a device for
> partitions/file systems. I thought, it would be sort of fun to use the
> same interface. :-)

Interesting idea. Works for me. :-)

> 3.1.1 Scanning
> 
> Scanning means identification of on-disk partitions and retrieving
> information about them. This covers quite exactly the functionality
> already implemented for the disk_scanner stuff. Identify() asks a disk
> system whether it knows about the format of a given partition. It
> returns a priority, a float value between 0 and 1, indicating how good
> the disk system thinks it can deal with the partition (or a value < 0,
> if it has no clue), and a cookie that can hold arbitrary data helping
> to speed up the further process. The disk system with the highest
> priority is asked to Scan() the partition. It fills out missing data 
> of
> the KPartition (name, type, block size, flags,...) and, in case of a
> partitioning system, adds KPartitions for the subpartitions. The 
> cookie
> returned by Identify() is freed in Scan(); for the other disk systems
> FreeIdentifyCookie() is invoked.
> 
> Currently Scan() might set a content cookie for the supplied partition
> and a cookie for each created subpartition. When freeing the 
> KPartition
> objects (or initializing with another disk system) FreeCookie() and
> FreeContentCookie() are invoked to free those cookies. I'm not sure, 
> if
> these cookies are such a good idea. They should definitely contain no
> additional information -- all information that cannot be explicitely
> represented by KPartition are to go into the parameters and content
> parameters fields. The intention was, that cookies could speed up
> requests, for data could be represented in a more convenient way.

I say leave them in for now and let's see how useful they are in the 
actual implementation. If we leave them in, we just make it clear what 
they're to be used for.

> To save some functions, the supports_() and validate_*() functions
> could be multiplexed to supports_partition_operation() and
> validate_partition_operation(). A proposal, what the parameters for
> these functions could be, is given in supports_validates_parameters.h.

I don't like the idea of multiplexing here, mostly because everyone who 
writes a module will have to deal with parameter interpretation. If we 
have separate functions, we write interpretation code that's run after 
the syscalls enter kernelland, and module implementors simply implement 
the necessary functions.

> 3.3 FS Add-Ons
> 
> The FS interface functions are similar to the ones of partition module
> interface, except for a few differences: To fit better into the rest 
> of
> the FS add-on interface, none of functions is passed a file 
> descriptor,
> but a path of the partition device instead. 

Wouldn't it be better to just open the proper device once and pass it 
around? This way you could also ensure the device is only accessed 
read-only during scanning. 

Phew. Time for bed. :-)

-Tyler

Other related posts: