[codeface] Re: Re-analyze a project

  • From: Mitchell Joblin <joblin.m@xxxxxxxxx>
  • To: codeface@xxxxxxxxxxxxx
  • Date: Wed, 1 Jul 2015 15:19:47 +0000

Hi Claus,

On Wed, Jul 1, 2015 at 2:35 PM, Claus Hunsen <hunsen@xxxxxxxxxxxxxxxxx> wrote:

Hi everybody,

I have two questions regarding the re-analysis of a project in Codeface.
Consider the circumstance that I have done a complete run of a project
already (e.g., the tagging is configured as "proximity"), all commits of
the given release-ranges are filled into the database, and afterwards
the developer-network analysis has been performed and written to the
results folder.


(1) How can I re-run the network-analysis part of Codeface, based on the
already filled database from the commit analysis? Is there any
possibility to do this?
To rephrase the question: Can I run the different parts of Codeface
(blame/commit analysis and network analysis) independently?

The rationale is simple: I could have fixed a bug in the
network-analysis part of Codeface and want to do the analysis again, but
do not want to re-fill the database ('commit' and 'commit_dependency'
tables) again. This would evidently save much time.

Yes, that is a perfectly rational use case that I also frequently
encounter. The first phase of codeface (the data extraction) is
incredibly time consuming compared to the subsequent phases. There are
at least two options to achieve what you want.

One option the "create_db" option to the cluster.py interface. I don't
recall if you can configure that from the configuration files, worst
case you need to edit the code. When create_db is set to false then
basically that first phase of data extraction on the version control
system is not performed. Before data is written to the database we
create a serialized VCS object and write that to disk. So the mysql
database will anyway be refilled but it should be much faster than
performing all the data extraction.

The second option is to just call the R script manually. The R script
"persons.r" is the main entry point for the graph analysis and you can
have a look at how python calls persons.r in project.py. It basically
passes some config files but nothing too tricky. This will be really
fast because the R scripts just query the database so it won't need to
be re-filled but the trade off is a little more manual effort in
crafting the call to the R script.




(2) Regarding the blame/commit analysis of Codeface (in particular, the
'commit' and 'commit_dependency' tables in the database): Is it possible
to run several kinds of tagging configurations at once or adding another
with a second run of Codeface?

At the moment, I write "tagging: proximity" in the configuration file,
thus, I have several projects in the Codeface database (e.g.,
"project_feature" and "project_file") for each repository that I analyze
(each with an independent set of the same commits in the 'commit' table!).

I can think of two scenarios that apply to the current situation:
- adding another commit analysis to the existing one inside the
database, so that both share their 'commit'-table entries (NO
independent sets anymore!).
- running several commit/blame analyses at once, e.g., by supplying
"tagging: [proximity, feature, file]" inside the configuration file.
Can we achieve this somehow, and, if yes, how?

Right, so we designed things with the intent to use only one tagging
method. It is annoying, wasteful, and ugly when one wants to do
multiple tagging types for one project. I agree now that we have so
many different tagging types they are no longer mutually exclusive as
they once were. I guess there would be several things we would need to
change. The database table for a project also contains the tagging
method.

Whats the main motivation for this? Is it to reduce the runtime to
perform multiple analysis or just make it easier to perform multiple
analysis regardless of runtime? Also, is part of the issue how the
database schema separates different tagging types into different
project entries?

Thanks,

Mitchell



I hope, you understand my thoughts and you are able to give a hint to
deal with the problems that arise from having several distinct project
in the database.

Best regards,
Claus







Other related posts: