Perl to please the client mind

When is information too much information? Well, if you work with data, probably never. But, clients and research report writers often need summarised information ready and waiting for KPI and ROI purposes.

In our previous post, we created this informative report of which quote contained the most words in ascending order.

ID  Movie                          WordsinQuote
15  Mary Poppins                   001
1   The Terminator                 003
9   Some randon film               003
10  Another random film            003
14  Dracula                        003
3   Predator                       004
11  Spinal Tap                     005
13  Raging Bull                    005
6   Con Air                        007
2   Terminator 2: Judgement Day    009
4   Kindergarten Cop               009
8   Kick-Ass                       010
7   Face/Off                       015
12  Snake Eyes                     018
5   Wild at Heart                  021
16  The Great Dictator             643

All well and good. They’re just movie quotes. But if the real data was survey research; for example – responses to a government consultation on a proposed plan – one of the key requirements is to determine a range of word counts the project responses generate.

The client isn’t going to trawl through the list. Neither is a report writer. That’s the developer’s/analyst’s job.

We’ll stick to movie quotes, but as always, the method can be applied to any textual data.

We’ll classify our requirements as follows; a word range for 1 to 5 words, 6 to 10 words, 11 to 20 words, 21 – 100 words, 100 plus words. We also legislate for dirty data… things that might fall through the net. You wouldn’t expect a silent movie to have an entry – it would have no quote – we need to allow for this.

So, here’s our CSV file. Note that entry for ID 17 has no quote.

ID,Movie,Year,Rating,Quotes
1,The Terminator,1984,5,"I'll be back"
2,Terminator 2: Judgement Day,1991,5,"I need your clothes, your boots and your motorcycle"
3,Predator,1987,4,"Get to the chopper!"
4,Kindergarten Cop,1990,3,"I'm a cop, you idiot! I'm Detective John Kimble!"
5,Wild at Heart,1990,5,"Did I ever tell ya that this here jacket represents a symbol of my individuality, and my belief in personal freedom?"
6,Con Air,1997,4,"Put... the bunny... back... in the box."
7,Face/Off,1997,1,"You'll be seeing a lot of changes around here. Papa's got a brand new bag."
8,Kick-Ass,2010,4,"Tool up, honey bunny. It's time to get bad guys."
9,Some randon film,2000,1,"Some random quote."
10,Another random film,2001,1,"Another random quote."
11,Spinal Tap,1984,5,"well,it's one louder, isn't it?"
12,Snake Eyes,1998,3,"I saw you and you saw me, don't pretend like you don't know who I am girly man"
13,Raging Bull,1980,5,"""I could've been a contender"""
14,Dracula,1958,4,"I am Dracula"
15,Mary Poppins,1964,4,"Supercalifragilisticexpialidocious"
16,The Great Dictator,1940,5,"I’m sorry, but I don’t want to be an emperor. That’s not my business. I don’t want to rule or conquer anyone. I should like to help everyone - if possible - Jew, Gentile - black man - white. We all want to help one another. Human beings are like that. We want to live by each other’s happiness - not by each other’s misery. We don’t want to hate and despise one another. In this world there is room for everyone. And the good earth is rich and can provide for everyone. The way of life can be free and beautiful, but we have lost the way.Greed has poisoned men’s souls, has barricaded the world with hate, has goose-stepped us into misery and bloodshed. We have developed speed, but we have shut ourselves in. Machinery that gives abundance has left us in want. Our knowledge has made us cynical. Our cleverness, hard and unkind. We think too much and feel too little. More than machinery we need humanity. More than cleverness we need kindness and gentleness. Without these qualities, life will be violent and all will be lost. The aeroplane and the radio have brought us closer together. The very nature of these inventions cries out for the goodness in men - cries out for universal brotherhood - for the unity of us all. Even now my voice is reaching millions throughout the world - millions of despairing men, women, and little children - victims of a system that makes men torture and imprison innocent people. To those who can hear me, I say - do not despair. The misery that is now upon us is but the passing of greed - the bitterness of men who fear the way of human progress. The hate of men will pass, and dictators die, and the power they took from the people will return to the people. And so long as men die, liberty will never perish. Soldiers! don’t give yourselves to brutes - men who despise you - enslave you - who regiment your lives - tell you what to do - what to think and what to feel! Who drill you - diet you - treat you like cattle, use you as cannon fodder. Don’t give yourselves to these unnatural men - machine men with machine minds and machine hearts! You are not machines! You are not cattle! You are men! You have the love of humanity in your hearts! You don’t hate! Only the unloved hate - the unloved and the unnatural! Soldiers! Don’t fight for slavery! Fight for liberty! In the 17th Chapter of St Luke it is written: “the Kingdom of God is within man” - not one man nor a group of men, but in all men! In you! You, the people have the power - the power to create machines. The power to create happiness! You, the people, have the power to make this life free and beautiful, to make this life a wonderful adventure. Then - in the name of democracy - let us use that power - let us all unite. Let us fight for a new world - a decent world that will give men a chance to work - that will give youth a future and old age a security. By the promise of these things, brutes have risen to power. But they lie! They do not fulfil that promise. They never will! Dictators free themselves but they enslave the people! Now let us fight to fulfil that promise! Let us fight to free the world - to do away with national barriers - to do away with greed, with hate and intolerance. Let us fight for a world of reason, a world where science and progress will lead to all men’s happiness. Soldiers! in the name of democracy, let us all unite!"
17,The General,1926,5,""

We use this subroutine to do the grunt work. Embedding it in the if, elsif, else conditions makes for clearer reading.

sub is_between {
my ($quotecount,$min,$max) = @_;
 return ($quotecount >= $min && $quotecount <= $max) ;
}

Let’s loop through the file and see if it works.

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);
use feature 'say';

my $csv = Text::CSV->new(
  {sep_char  => ',', binary => 1,
   auto_diag =>  1, eol     => $\
  }
);

# use the subroutine against if elsif else conditions
# don't put the conditions in the sub
# easier code to read with conditions outside subroutine

sub is_between {
my ($quotecount,$min,$max) = @_;
 return ($quotecount >= $min && $quotecount <= $max) ;
}

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

while (my $row = $csv->getline($fh)) {
  my %data ;
  @data{@fields} = @{$row} ;
  $data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
  
  if ( is_between($data{Quotecount},1,5) ) {
    say "ID $data{ID}  1-5 words";
    }
    elsif ( is_between($data{Quotecount},6,10) ) {
      say "ID $data{ID}  6-10 words";
      }
    elsif ( is_between($data{Quotecount},11,20) ) {
      say "ID $data{ID}  11-20 words";
      }
    elsif ( is_between($data{Quotecount},21,100) ) {
      say "ID $data{ID}  21-100 words";
      }
    elsif ( $data{Quotecount} > 100 ) {
      say "ID $data{ID}  Greater than 100 words";
      }
      # capture silent movies that don't have quotes	
      else {
        say "ID $data{ID}  Can't apply words to range";
      }
}
close $fh;

Which produces…

ID-1  1-5 words
ID-2  6-10 words
ID-3  1-5 words
ID-4  6-10 words
ID-5  21-100 words
ID-6  6-10 words
ID-7  11-20 words
ID-8  6-10 words
ID-9  1-5 words
ID-10  1-5 words
ID-11  1-5 words
ID-12  11-20 words
ID-13  1-5 words
ID-14  1-5 words
ID-15  1-5 words
ID-16  Greater than 100 words
ID-17  Can't apply words to range

So far so good. Bit messy with the formatting spaces. But this is a temporary solution, so that’s unimportant. We’ll adjust the main body of the code as we want to incorporate the above data into something improved.

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

my @array;
while (my $row = $csv->getline($fh)) {
  my %data ;
  @data{@fields} = @{$row} ;
  $data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
  
  if ( is_between($data{Quotecount},1,5) ) {
    $data{Range} = "1-5 words";
    }
    elsif ( is_between($data{Quotecount},6,10) ) {
      $data{Range} = "6-10 words";
      }
    elsif ( is_between($data{Quotecount},11,20) ) {
      $data{Range} = "11-20 words";
      }
    elsif ( is_between($data{Quotecount},21,100) ) {
      $data{Range} = "21-100 words";
      }
    elsif ( $data{Quotecount} > 100 ) {
      $data{Range} = "Greater than 100 words";
      }
      # capture silent movies that don't have quotes	
      else {
        $data{Range} = "Can't apply words to range";
      }
push (@array,\%data);
}
close $fh;
say pp (\@array);

We’ve now added $data{Range} to the %data hash, which in turn, gets pushed to @array (creating an array of hashes). Here’s a snippet of the output.

[
  {
    ID         => 1,
    Movie      => "The Terminator",
    Quotecount => 3,
    Quotes     => "I'll be back",
    Range      => "1-5 words",
    Rating     => 5,
    Year       => 1984,
  },
  {
    ID         => 2,
    Movie      => "Terminator 2: Judgement Day",
    Quotecount => 9,
    Quotes     => "I need your clothes, your boots and your motorcycle",
    Range      => "6-10 words",
    Rating     => 5,
    Year       => 1991,
  },
...
  {
    ID         => 17,
    Movie      => "The General",
    Quotecount => 0,
    Quotes     => "",
    Range      => "Can't apply words to range",
    Rating     => 5,
    Year       => 1926,
  }
]

It’s important to note that all this might seem excessive, but we have the data we need to provide an audit trail, even though the eventual aim is to provide just one element of this data to the client. We have Quotes, Quotecount and Range all nicely embedded to prove our data is accurate and reliable. Remember the first paragraph… as the analyst, you can never have too much data. Even if you never present all of it, have it locked away at your disposal.

Notice how the else condition has captured ID 17 as it is a silent film – and therefore no quote. This happens with real data even when you expect a response. People submit empty text boxes in the survey data world, even when they’re supposed to have an entry.

So far, the above doesn’t clearly tell a client their word count ranges without having to read between the rest of the data. We need to provide them with a count for each occurrence of each word count range.

We add this code.

my @count = map { $_->{Range} } @array;

What has this done? Let’s print the contents of @count to check what it contains.

say join ("," , @count);

1-5 words, 6-10 words, 1-5 words, 6-10 words, 21-100 words, 6-10 words, 11-20 words, 6-10 words, 1-5 words, 1-5 words, 1-5 words, 11-20 words, 1-5 words, 1-5 words, 1-5 words, Greater than 100 words, Can't apply words to range

We have the values of $_->{Range} (from @array) in @count. We can treat these as hash keys and count their occurrence (which is the hash value) that increases if seen more than once in a loop.

my %counter;
foreach (@count) {
  $counter{$_}++;
}
say pp (\%counter);

This gives us the desired output…

{
  "1-5 words" => 8,
  "11-20 words" => 2,
  "21-100 words" => 1,
  "6-10 words" => 4,
  "Can't apply words to range" => 1,
  "Greater than 100 words" => 1,
}

Here’s the complete code

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);
use feature 'say';

my $csv = Text::CSV->new(
  {sep_char  => ',', binary => 1,
   auto_diag =>  1, eol     => $\
  }
);

# don't put the conditions in the sub
# easier code to read with conditions outside subroutine

sub is_between {
my ($quotecount,$min,$max) = @_;
 return ($quotecount >= $min && $quotecount <= $max) ;
}

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

my @array;
while (my $row = $csv->getline($fh)) {
  my %data ;
  @data{@fields} = @{$row} ;
  $data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
  
  if ( is_between($data{Quotecount},1,5) ) {
    $data{Range} = "1-5 words";
    }
    elsif ( is_between($data{Quotecount},6,10) ) {
      $data{Range} = "6-10 words";
      }
    elsif ( is_between($data{Quotecount},11,20) ) {
      $data{Range} = "11-20 words";
      }
    elsif ( is_between($data{Quotecount},21,100) ) {
      $data{Range} = "21-100 words";
      }
    elsif ( $data{Quotecount} > 100 ) {
      $data{Range} = "Greater than 100 words";
      }
      # capture silent movies that don't have quotes	
      else {
        $data{Range} = "Can't apply words to range";
      }
push (@array,\%data);
}
close $fh;
say pp (\@array);

# sum the occurrences

# create an array from the specific hash values of 'Range'
my @count = map { $_->{Range} } @array;
say join (", " , @count);


my %counter;
foreach (@count) {
  $counter{$_}++;
}
say pp (\%counter);

Until the next time.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: