[codeface] Re: [PATCH 3/3] Add global email analysis to database

  • From: Wolfgang Mauerer <wolfgang.mauerer@xxxxxxxxxxxxxxxxx>
  • To: <codeface@xxxxxxxxxxxxx>
  • Date: Thu, 5 Nov 2015 00:57:23 +0100



Am 26/10/15 um 14:15 schrieb Mitchell Joblin:

On Mon, Oct 26, 2015 at 12:43 PM, Andreas Ringlstetter
<andreas.ringlstetter@xxxxxxxxxxxxxxxxxxxx> wrote:
Good Morning,

Am 26.10.2015 um 11:54 schrieb Mitchell Joblin:
+
+-- -----------------------------------------------------
+-- Table `codeface`.`mail`
+-- -----------------------------------------------------
+DROP TABLE IF EXISTS `codeface`.`mail` ;
+
+CREATE TABLE IF NOT EXISTS `codeface`.`mail` (
+ `id` BIGINT NOT NULL AUTO_INCREMENT,
+ `projectId` BIGINT NOT NULL,
+ `emailId` VARCHAR(255) NOT NULL,

Why are you using a varchar for an ID?
Or is this supposed to be the rfc2111 "Message-ID"?
Why "NOT NULL" in the latter case? It's optional.

This is an id created during the corpus generation, its not the
Message-ID. I originally thought we needed to maintain this for some
local operations on the corpus but I think we won't have to store it.


Also questioning the length,
https://tools.ietf.org/html/rfc4130#section-5.3.3 states that the ID
should be less than 255 characters, but that's not enforced. Neither the
code nor the database scheme contains any safeguard for this case.

And if it actually is the "Message-ID", why is the corresponding
"Reply-To" field discarded? (Which should btw. be resolved prior to
storing it in the database.) Without it, the thread structure can't be
reconstructed.

Also, if it's the Message-ID, is it the mangled form without (optional)
brackets or the raw form?

+ `threadId` BIGINT NOT NULL,
+ `mlId` BIGINT NOT NULL,
+ `author` BIGINT NOT NULL,
+ `subject` VARCHAR(255) NULL DEFAULT NULL,

This will break as soon as a thread contains only a single subject line
longer than expected.
Is it even necessary to record the subject? It shouldn't differ to much
from the parent topic?
And why allowing NULL? It's a mandatory field AFAIK.

This is currently how we handle the subject line in the mail_thread
table. If we want to handle the subject in a different way then please
provide a patch for the mail_thread table as well so that we can
handle it uniformly. Maybe Wolfgang can provide the rationalization
for using VARCHAR(255) for the subject. So far we have had no issues.

the only rationale I remember is that the amount of meaningful messages
with subjects of more than 255 characters is 0 to a very good
approximation over all mailing lists I know. RFC5322 defines a maximal
line length of 998 characters, but you can fold longer lines into
multiple ones. If I read the standard correctly (just had a very brief
glimpse, though), there's thus no upper limit.

Unless we encounter real issues with 255 characters, I'd just leave
things as they are.





+ `creationDate` DATETIME NOT NULL,
+ PRIMARY KEY (`id`),
+ INDEX `mail_author_idx` (`author` ASC),
+ INDEX `mail_projectId_idx` (`projectId` ASC),
+ INDEX `mail_mlId_idx` (`mlId` ASC),

Please do me a favor and don't only add the keys required for contraints
to work, but also composite keys covering attribute combinations
commonly used in queries. This table is going to have a lot of data in
it, and you don't want to cause a full table scan, less so on possible
joins.

Sorry but I don't understand the suggestion. Do you have an example,
or a table which already does this?


At least add indexes on the text rows, or querying them will be slow
beyond reason.

Which text are being referred to? I cannot imagine that we ever query
by subject.

Thanks,

Mitchell


+ CONSTRAINT `mail_author`
+ FOREIGN KEY (`author`)
+ REFERENCES `codeface`.`person` (`id`)
+ ON DELETE CASCADE
+ ON UPDATE CASCADE,
+ CONSTRAINT `mail_projectId`
+ FOREIGN KEY (`projectId`)
+ REFERENCES `codeface`.`project` (`id`)
+ ON DELETE CASCADE
+ ON UPDATE CASCADE,
+ CONSTRAINT `mail_mlId`
+ FOREIGN KEY (`mlId`)
+ REFERENCES `codeface`.`mailing_list` (`id`)
+ ON DELETE CASCADE
+ ON UPDATE CASCADE)
+ENGINE = InnoDB;
+
USE `codeface` ;

-- -----------------------------------------------------
--

-- Andreas



Other related posts: