[Ilugc] gawk query - csv processing. moving away from MS-Access

From: raamanp@xxxxxxxxxxx (Raman.P)
Date: Tue Nov 3 15:59:05 2009

--- On Tue, 3/11/09, Siva Subramanian <elpostssv@xxxxxxxxxxxxxx> wrote:

I am into my 1st BSc Statistics and into a project for an
MNC. I am
trying my first hand at linux and was recently introduced
to gawk.

I am having trouble processing a statistical dump that is
provided to us in the
form of a csv file. The format of the file is given below

C_ID, ID_NO, stat1, vol2, amount3,...

There are around 40 fields and the csv file has close to a
million records

The C_ID, is the customer id and is only way to identify
the customer.

The ID_NO field is the premium plan the customer is in

stat1, vol2, amounts are all numbers

I can write a query that uses a few if statements and gets
the details if the
ID_NO is sequential.

1.      
However,
there are over 1000 different ID_NO (legacy) and we need to
add the stat1, vol2,
amount3 for each of the ID_NOs separately (I need to group
the ID_NO, and sum
of the fields)

2.      
If
I have the C_ID of the customers in a separate csv, is it
possible to compare
the C_ID with that of the C_ID in the dump and determine
the sum of stat1,
vol2, amount3?  (sum of the fields only
for a set of customers and not for the ID_NOs in whole)

We are extensively using
MS-Access for this and it has been a pain. A friend
suggested that I try my
hand at using tools in linux.

I am not sure if this is the
right mailing list for this.

Will really appreciate any help
in this regard.

I cannot help on gawk. However perl has nice dbi interface to csv files.

here is the standard example for reading /etc/passwd using sql

require DBI;
           my $dbh = DBI->connect("DBI:CSV:");
           $dbh->{'csv_tables'}->{'passwd'} = {
               'eol' => "\n",
               'sep_char' => ":",
               'quote_char' => undef,
               'escape_char' => undef,
               'file' => '/etc/passwd',
               'col_names' => ["login", "password", "uid", "gid", "realname",
                               "directory", "shell"]
           };
           $sth = $dbh->prepare("SELECT * FROM passwd where login like 'ram%'");
$sth->execute();
while (@array=$sth->fetchrow_array()){
        print "@array\n\n";
}

There are specialised modules for csv at cpan.org. Million records should not
be a problem at all.

Raman.P
blog:http://ramanchennai.wordpress.com/

      Keep up with people you care about with Yahoo! India Mail. Learn how.
http://in.overview.mail.yahoo.com/connectmore

Follow-Ups:
- [Ilugc] gawk query - csv processing. moving away from MS-Access
  - From: Raman.P
- [Ilugc] gawk query - csv processing. moving away from MS-Access
  - From: Shakthi Kannan

References:
- [Ilugc] gawk query - csv processing. moving away from MS-Access
  - From: Siva Subramanian

[Ilugc] gawk query - csv processing. moving away from MS-Access

Other related posts: