[codeface] Re: [PATCH 3/3] Add global email analysis to database

  • From: Andreas Ringlstetter <andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>
  • Date: Mon, 26 Oct 2015 15:38:31 +0100



Am 26.10.2015 um 15:19 schrieb Mitchell Joblin:

On Mon, Oct 26, 2015 at 2:56 PM, Andreas Ringlstetter
<andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx> wrote:


Am 26.10.2015 um 14:15 schrieb Mitchell Joblin:
On Mon, Oct 26, 2015 at 12:43 PM, Andreas Ringlstetter
<andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx> wrote:
Good Morning,

Am 26.10.2015 um 11:54 schrieb Mitchell Joblin:
+
+-- -----------------------------------------------------
+-- Table `codeface`.`mail`
+-- -----------------------------------------------------
+DROP TABLE IF EXISTS `codeface`.`mail` ;
+
+CREATE TABLE IF NOT EXISTS `codeface`.`mail` (
+ `id` BIGINT NOT NULL AUTO_INCREMENT,
+ `projectId` BIGINT NOT NULL,
+ `emailId` VARCHAR(255) NOT NULL,

Why are you using a varchar for an ID?
Or is this supposed to be the rfc2111 "Message-ID"?
Why "NOT NULL" in the latter case? It's optional.

This is an id created during the corpus generation, its not the
Message-ID. I originally thought we needed to maintain this for some
local operations on the corpus but I think we won't have to store it.

Ah, the local message ID generated by snatm? I don't think that can be
reused in any way.


Also questioning the length,
https://tools.ietf.org/html/rfc4130#section-5.3.3 states that the ID
should be less than 255 characters, but that's not enforced. Neither the
code nor the database scheme contains any safeguard for this case.

And if it actually is the "Message-ID", why is the corresponding
"Reply-To" field discarded? (Which should btw. be resolved prior to
storing it in the database.) Without it, the thread structure can't be
reconstructed.

Also, if it's the Message-ID, is it the mangled form without (optional)
brackets or the raw form?

+ `threadId` BIGINT NOT NULL,
+ `mlId` BIGINT NOT NULL,
+ `author` BIGINT NOT NULL,
+ `subject` VARCHAR(255) NULL DEFAULT NULL,

This will break as soon as a thread contains only a single subject line
longer than expected.
Is it even necessary to record the subject? It shouldn't differ to much
from the parent topic?
And why allowing NULL? It's a mandatory field AFAIK.

This is currently how we handle the subject line in the mail_thread
table. If we want to handle the subject in a different way then please
provide a patch for the mail_thread table as well so that we can
handle it uniformly. Maybe Wolfgang can provide the rationalization
for using VARCHAR(255) for the subject. So far we have had no issues.


+ `creationDate` DATETIME NOT NULL,
+ PRIMARY KEY (`id`),
+ INDEX `mail_author_idx` (`author` ASC),
+ INDEX `mail_projectId_idx` (`projectId` ASC),
+ INDEX `mail_mlId_idx` (`mlId` ASC),

Please do me a favor and don't only add the keys required for contraints
to work, but also composite keys covering attribute combinations
commonly used in queries. This table is going to have a lot of data in
it, and you don't want to cause a full table scan, less so on possible
joins.

Sorry but I don't understand the suggestion. Do you have an example,
or a table which already does this?

Non of the existing tables it using it, which would explain the bad
performance when dealing with complex joins or larger tables.

If you have expensive queries which are querying for multiple constant
field values in a single table, add a composite index covering precisely
these fields. Especially true when such an table is used as a cross
table in a multiple table join.

MySQL will be happy to run your query solely on indexes if it can,
avoiding table or range scans entirely if possible. But that only works
if there are suitable indexes for costly queries.

For this table I can imagine that the queries would be only to
retrieve a set of rows based on the date column for a given project
and mailinglist. What would be the composite index that would help
with this query?

INDEX `mail_AUTOGENERATED_idx` (`mlId` ASC, `projectId` ASC,
`creationDate`ASC),

MySQL workbench has it's own naming schema for indexes, based on the
columns used.

ASC / DESC doesn't actually matter as long as the query doesn't have a
SORT BY clause. When it does, it should match.

Constant values first (mlID and projectId), followed by at most a single
range (`creationDate` since it's the only one). MySQL will use this
index as well to satisfy queries without the creationDate clause, as
long as a prefix of the index can be used.

The query cost is now reduced to a simple lookup in a b-tree plus the
cost for actually fetching the rows.

OK, that was actually oversimplified, and indexes can be used even more
efficiently with knowledge of the data, but using them at all in the
first place should already help a lot for queries known to be costly.

Use the EXPLAIN statement to see if your query is making use of any indexes.

--Andreas



At least add indexes on the text rows, or querying them will be slow
beyond reason.

Which text are being referred to? I cannot imagine that we ever query
by subject.

So the field is just stored for decoration?

No :) The intended meaning was that we won't ever use the subject in a
selection criterion. It maybe used for visualization or information
retrieval methods once retrieved from the db. I don't see how the
index will help in that regard and adding an index will just slow down
the inserts and deletions which we do on every run of the analysis.

--Mitchell


-- Andreas

Thanks,

Mitchell


+ CONSTRAINT `mail_author`
+ FOREIGN KEY (`author`)
+ REFERENCES `codeface`.`person` (`id`)
+ ON DELETE CASCADE
+ ON UPDATE CASCADE,
+ CONSTRAINT `mail_projectId`
+ FOREIGN KEY (`projectId`)
+ REFERENCES `codeface`.`project` (`id`)
+ ON DELETE CASCADE
+ ON UPDATE CASCADE,
+ CONSTRAINT `mail_mlId`
+ FOREIGN KEY (`mlId`)
+ REFERENCES `codeface`.`mailing_list` (`id`)
+ ON DELETE CASCADE
+ ON UPDATE CASCADE)
+ENGINE = InnoDB;
+
USE `codeface` ;

-- -----------------------------------------------------
--

-- Andreas





Other related posts: