Featured

Music listening via Last.fm

Once upon a time, before the Raspberry Pi allowed for cheap DIY projects, I purchased a Logitech Squeezebox Radio. The greatest feature was the music streaming service of Last.fm. Then came Spotify and others; and Last.fm eventually ceased streaming.

But, what it’s always done (setup dependent), is log what you listen to via its scrobbling service. Effectively, every time you listen to Spotify, radio stations, music on your smart phone… you POST data (track, song title, album, artist etc.) to your Last.fm account.

My musical taste hasn’t changed much over a decade, as I’m too long in the tooth to change my listening flavour, which is predominantly the music I grew up with in the 80s and 90s. Of course, I’m not adverse to discovering great new music, but most people have a staple diet.

Recently however, my listening library has been blemished by my children’s listening habits as they monopolise the devices around the house. I love them with all my heart. But we’ve got to do something about their tastes in music.

So, I’m automating a weekly script to grab the top 5 artists per week – scrobbled from various sources to my Last.fm account. Regular monitoring will tell if their listening is outweighing mine.

I make use of the Last.fm api. I’m using Perl, and although there are a few dedicated modules available to garner data from Last.fm, I’ll stick to LWP::UserAgent as the client, and place all the required parameters into the url.

Oh… and let’s visualise the results with a bar chart.

The script

##########################
# last fm weekly artists #
##########################

# top 5 weekly artists to barchart
# not using Last FM cpan module

#!/usr/bin/perl
use strict ;
use warnings ;
use LWP::UserAgent;
use JSON qw(decode_json);
use Config::Tiny;
use POSIX qw (strftime);
use GD::Graph::bars;

my $date = strftime "%e-%b-%Y",localtime;

my $config_file = "$ENV{HOME}/.lastfm.cnf";
die "$config_file not there" unless -e $config_file;

my $config = Config::Tiny->read($config_file);

my $user    = $config->{lastfm}->{user};
my $api_key = $config->{lastfm}->{api_key};
                  
my $base_url    = "http://ws.audioscrobbler.com/2.0";
my $method_url  = "user.getweeklyartistchart";
my $format      = "json";

my $request_url = "$base_url/?method=$method_url&user=$user&api_key=$api_key&format=$format";

# send the request and decode json to perl data structure
# not too much held in memory - no need for content_reference
# or write to disk
my $ua        = LWP::UserAgent->new();
my $request   = $ua->get($request_url);
my $json      = $request->decoded_content;
my $perl_data = decode_json($json);

my (@artists, @playcounts);

# only need the top 5 weekly artists in barchart
my $x = 0;
DATUM:
foreach my $thing ( @{$perl_data->{weeklyartistchart}->{artist} } ) {
$x++;
push (@artists,$thing->{'name'});
push (@playcounts,$thing->{'playcount'});
# exit when 5 is reached
last DATUM if $x == 5;
}

# artists and playcounts to bar chart
# create the layout
my $data = GD::Graph::Data->new( [ \@artists,\@playcounts ] );
my $graph =  GD::Graph::bars->new();

$graph->set(
x_label => 'ARTISTS',
y_label => 'PLAYCOUNT',
x_labels_vertical => 1,
bar_spacing       => 1,
title   => "Last.fm data $date",
) or die $graph->error;

$graph->plot($data) or die $graph->error;

# barchart to image file
my $file = "WeeklyArtists_$date.png";
open (my $picture,'>',$file) or die "Cannot open file $file $!";
binmode $picture;
print $picture $graph->gd->png;
close $picture;
exit;

Which gives me a little image file that displays…

All looks ok. The last artist I know little about. But, if you’ve ever seen the movie Drive (2011), then this is a standout track https://www.youtube.com/watch?v=-DSVDcw6iW8.

It’s quite easy to build on this, and obtain different data from Last.fm. If you don’t yet scrobble to Last.fm and you’d like to; simply head over to https://www.last.fm/ and create an account. Then on whatever listening platform; Spotify, Deezer etc. enable scrobbling from the settings.

To my knowledge, Amazon Music does not scrobble to Last.fm – so this might prove a forlorn exercise.

Until the next time.

Don’t treat your date like just another number

There’s more than one type of date that can prove awkward, but this is Perl, and so I’m talking calendar dates. And when I say awkward, I mean the interpretation of how a date should be treated.

It’s very tempting when we see something like this, to think we can always get away with comparing dates as we see them.

10/12/1994
11/12/1994
01/01/1995
02/01/1995
01/01/1996

Surely, it’s a simple case of comparing by means of greater than / less than? That is to say, if we can do this…

#!/usr/bin/perl
use strict;
use warnings;

# if 4 is greater than 2 - which it is
if ( 4 > 2 ) {
  print "Four is greater than two" , "\n";
  }
    else {
      print "Four is not greater than two";
      }

Which, as you might expect, outputs…

Four is greater than two

… we can do something similar with a date?

# if 1995 comes before 1996 - which it does
if ( 01/01/1996 > 10/01/1995) {
  print "1996 is a later date than 1995 " , "\n";
  }
    else {
      print "Soemthing not quite right" , "\n";
      }

Which produces…

Something not quite right

The short answer is, we can’t treat these as dates… yet. We’re just treating them as strings or numbers, depending how you use the operator that is doing the comparing. And if we are treating them as numbers, we’re doing unwanted procedures with the / character, that is creating divisional calculations.

So, here’s something more robust.

#!/usr/bin/perl
use strict;
use warnings;
use DateTime::Format::Strptime;
use feature 'say';

# create a date format - British style

my $format = DateTime::Format::Strptime->new( pattern => '%d/%m/%Y' );

# create a date to use to compare against what's in __DATA__

my $compare_date = DateTime->new(year   => 1994, 
                                 month  => 12, 
                                 day    => 31, 
                                 formatter => $format
                                 );
            
while (<DATA>) {
	chomp;
	my ($id,$name,$date_text) = split (/,/,$_);
	
	# set the text in data to Date objects
    my $date = $format->parse_datetime($date_text);
    
    # set the formatting style
    
    $date->set_formatter($format);
    $format = $date->formatter();
    
    # if 31/12/1994 is a later date than the string-to-date object
     
    if ( $compare_date > $date) {
		say "$id date $date comes before " . $compare_date;
	}
	
	# if 31/12/1994 is an earlier date than the string-to-date object
	  else {
		say "$id date $date comes after " . $compare_date;
	  }
}


close DATA;

__DATA__
ID1,bar,10/12/1994
ID2,baz,11/12/1994
ID3,foobar,01/01/1995
ID4,foobaz,02/01/1995
ID5,foofoo,01/01/1996

We create a formatting style for dates – date of month, month and year using DateTime::Format::Strptime and assign it to $format.

We set a date object $compare_date with DateTime. This is used to compare the dates parsed in __DATA__.

If we don’t set the format for $date (taken from $date_text in __DATA__) on lines 28 and 29, then $date will be output using ISO 8601 format.

The output is…

ID1 date 10/12/1994 comes before 31/12/1994
ID2 date 11/12/1994 comes before 31/12/1994
ID3 date 01/01/1995 comes after 31/12/1994
ID4 date 02/01/1995 comes after 31/12/1994
ID5 date 01/01/1996 comes after 31/12/1994

Until the next time.

Right Place Right Name::Space – Stick to the Road

In order for a Perl script to find its required subroutine or subroutines in a build your own module, there are a few caveats to take into account. There are a number of modules that can replace the need to write ‘use lib’, and that’s fine. You could add any pathways that home grown modules reside in to the .~/.bashrc file (if on Unix / Linux), in order for them to be added to @INC, which is a collection of pathways that Perl searches for when a module is being imported via script execution. This is where modules shipped with the a specific version of Perl are kept, and those that are installed via CPAN.

But for good old understanding of name spaces, the ‘use lib’ pragma is an option to write explicitly. And name spaces are simply a matter of knowing where your module and script reside in the directory structure. The syntax provided ensures they can communicate. Understand absolute directory pathways… understand name spaces.

tree’ is your friend

The ‘tree’ command can assist with a visual understanding. Here’s the output of a main directory and its sub directories.

DataCage->tree
.
├── lib
│   └── My
│       └── ModuleDir
│           └── Shortwave.pm
├── Production
│   └── Megahertz.pl
└── t

5 directories, 2 files

The 5 directories are… the lib directory with a further My directory, which in turn has a ModuleDir directory. Parallel to the lib directory are a Production directory and a t directory. The script ‘Megahertz.pl’ sits in the Production directory. The module ‘Shortwave.pm’ sits in My/ModuleDir directory . The ‘t’ directory is empty.

The Shortwave.pm module contains a simple subroutine that the Megahertz.pl script needs. The user input of Megahertz.pl serves as the argument to the converter subroutine. These are the 2 files.

Here’s the Shortwave.pm module.

package My::ModuleDir::Shortwave;
use Scalar::Util 'looks_like_number';

our $VERSION = 0.01;

use base 'Exporter';
our @EXPORT_OK = qw(converter looks_like_number);
our %EXPORT_TAGS = ( all => \@EXPORT_OK);

sub converter {
  my $DIVIDER = 299.792458;
  my $number = shift;
  my $new_number = ($DIVIDER / $number);
  return $new_number;
}

Note that; package My::ModuleDir reflects the pathway that the tree structure displays. And ‘Shortwave’ represents the name of the module – Shortwave.pm. We haven’t given the full path from the main directory. In other words; we haven’t written package lib::My::ModuleDir::Shortwave.

The reason for that is the; use lib ‘lib’ entry in Megahertz.pl, which effectively is giving the leading directory name (lib) where My/ModulreDir/Shortwave.pm follows. If the lib directory were called ‘BigDir’, then the entry in Megahertz.pl would be; use lib ‘BigDir’.

Here’s the Megahertz.pl script

###############################
# convert Megahertz to Metres #
###############################

#!/usr/bin/perl
use strict;
use warnings;
use lib 'lib';
use My::ModuleDir::Shortwave ':all';
use feature 'say';

# get user input

print 'Enter Megahertz: ';
my $frequency = <STDIN>;
chomp $frequency;

# if the input is a true value and looks like a number

if ( ($frequency ) && (looks_like_number($frequency)) ) {
  say converter($frequency) ." Metres";
   }
   else {
      say "Need a number - a real number";
    }
exit;

As well as; use lib, we tell Megahertz.pl what module to use, or rather what module to import and what subroutine from that module to use. This is represented by; use My::ModuleDir::Shortwave ‘:all’; – the ‘:all’ representing all the subroutines from Shortwave.pm.

There are, in fact, two subroutines to use from Shortwave.pm. That’s because in addition to the home made converter subroutine, the Scalar::Util module was declared in the Shortwave module rather than the Megahertz script. And it’s using the ‘looks_like_number’ subroutine from Scalar::Util. And because it was declared in the module, it in turn is imported into the script by means of; use My::ModuleDir::Shortwave ‘:all’.

An intentional error to highlight Megahertz.pl cannot find Shortwave.pm

We navigate to the directory where Megahertz.pl sits. We run it.

DataCage->cd Production/
DataCage->ls
Megahertz.pl
DataCage->perl Megahertz.pl

But things go awry…

Can't locate My/ModuleDir/Shortwave.pm in @INC (you may need to install the My::ModuleDir::Shortwave module) (@INC contains: lib /etc/perl /usr/local/lib/perl/5.18.2 /usr/local/share/perl/5.18.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.18 /usr/share/perl/5.18 /usr/local/lib/site_perl .) at Megahertz.pl line 9.
BEGIN failed--compilation aborted at Megahertz.pl line 9.

What went wrong? Well, there was no need to navigate to the directory of Megahertz.pl. In doing so, we’ve messed up the intended meaning of the ; use lib pragma. If we are in the Production folder (where Megahertz.pl resides) – we have to go up a level in order for the lib directory to be visible to Megahertz.pl – we’d have to change the syntax to; use lib ‘../lib’ in order for things to work.

#!/usr/bin/perl
use strict;
use warnings;
use lib '../lib';
use My::ModuleDir::Shortwave ':all';
use feature 'say';

Probably not a good idea.

Stay off the moors. Stick to the road.”

Some sound advice to avoid being attacked by a werewolf. But it also means that Megahertz.pl will run successfully if we don’t navigate off the main directory path, and we can see the lib, Production and t directory on the same level.

.⇐ YOU ARE HERE
├── lib
│   └── My
│       └── ModuleDir
│           └── Shortwave.pm
├── Production
│   └── Megahertz.pl
└── t

We keep it as planned…

use lib 'lib';

And we execute by providing the full pathway of the main directory level – Production/Megahertz.pl.

DataCage->ls
lib  Production  t
DataCage->perl Production/Megahertz.pl

Which results in Megahertz.pl successfully importing lib/My/ModuleDir/Shortwave.pm – aka – package My::ModuleDir::Shortwave.

Enter Megahertz: 9.68
30.9702952479339 Metres

Until the next time.

Perl to please the client mind

When is information too much information? Well, if you work with data, probably never. But, clients and research report writers often need summarised information ready and waiting for KPI and ROI purposes.

In our previous post, we created this informative report of which quote contained the most words in ascending order.

ID  Movie                          WordsinQuote
15  Mary Poppins                   001
1   The Terminator                 003
9   Some randon film               003
10  Another random film            003
14  Dracula                        003
3   Predator                       004
11  Spinal Tap                     005
13  Raging Bull                    005
6   Con Air                        007
2   Terminator 2: Judgement Day    009
4   Kindergarten Cop               009
8   Kick-Ass                       010
7   Face/Off                       015
12  Snake Eyes                     018
5   Wild at Heart                  021
16  The Great Dictator             643

All well and good. They’re just movie quotes. But if the real data was survey research; for example – responses to a government consultation on a proposed plan – one of the key requirements is to determine a range of word counts the project responses generate.

The client isn’t going to trawl through the list. Neither is a report writer. That’s the developer’s/analyst’s job.

We’ll stick to movie quotes, but as always, the method can be applied to any textual data.

We’ll classify our requirements as follows; a word range for 1 to 5 words, 6 to 10 words, 11 to 20 words, 21 – 100 words, 100 plus words. We also legislate for dirty data… things that might fall through the net. You wouldn’t expect a silent movie to have an entry – it would have no quote – we need to allow for this.

So, here’s our CSV file. Note that entry for ID 17 has no quote.

ID,Movie,Year,Rating,Quotes
1,The Terminator,1984,5,"I'll be back"
2,Terminator 2: Judgement Day,1991,5,"I need your clothes, your boots and your motorcycle"
3,Predator,1987,4,"Get to the chopper!"
4,Kindergarten Cop,1990,3,"I'm a cop, you idiot! I'm Detective John Kimble!"
5,Wild at Heart,1990,5,"Did I ever tell ya that this here jacket represents a symbol of my individuality, and my belief in personal freedom?"
6,Con Air,1997,4,"Put... the bunny... back... in the box."
7,Face/Off,1997,1,"You'll be seeing a lot of changes around here. Papa's got a brand new bag."
8,Kick-Ass,2010,4,"Tool up, honey bunny. It's time to get bad guys."
9,Some randon film,2000,1,"Some random quote."
10,Another random film,2001,1,"Another random quote."
11,Spinal Tap,1984,5,"well,it's one louder, isn't it?"
12,Snake Eyes,1998,3,"I saw you and you saw me, don't pretend like you don't know who I am girly man"
13,Raging Bull,1980,5,"""I could've been a contender"""
14,Dracula,1958,4,"I am Dracula"
15,Mary Poppins,1964,4,"Supercalifragilisticexpialidocious"
16,The Great Dictator,1940,5,"I’m sorry, but I don’t want to be an emperor. That’s not my business. I don’t want to rule or conquer anyone. I should like to help everyone - if possible - Jew, Gentile - black man - white. We all want to help one another. Human beings are like that. We want to live by each other’s happiness - not by each other’s misery. We don’t want to hate and despise one another. In this world there is room for everyone. And the good earth is rich and can provide for everyone. The way of life can be free and beautiful, but we have lost the way.Greed has poisoned men’s souls, has barricaded the world with hate, has goose-stepped us into misery and bloodshed. We have developed speed, but we have shut ourselves in. Machinery that gives abundance has left us in want. Our knowledge has made us cynical. Our cleverness, hard and unkind. We think too much and feel too little. More than machinery we need humanity. More than cleverness we need kindness and gentleness. Without these qualities, life will be violent and all will be lost. The aeroplane and the radio have brought us closer together. The very nature of these inventions cries out for the goodness in men - cries out for universal brotherhood - for the unity of us all. Even now my voice is reaching millions throughout the world - millions of despairing men, women, and little children - victims of a system that makes men torture and imprison innocent people. To those who can hear me, I say - do not despair. The misery that is now upon us is but the passing of greed - the bitterness of men who fear the way of human progress. The hate of men will pass, and dictators die, and the power they took from the people will return to the people. And so long as men die, liberty will never perish. Soldiers! don’t give yourselves to brutes - men who despise you - enslave you - who regiment your lives - tell you what to do - what to think and what to feel! Who drill you - diet you - treat you like cattle, use you as cannon fodder. Don’t give yourselves to these unnatural men - machine men with machine minds and machine hearts! You are not machines! You are not cattle! You are men! You have the love of humanity in your hearts! You don’t hate! Only the unloved hate - the unloved and the unnatural! Soldiers! Don’t fight for slavery! Fight for liberty! In the 17th Chapter of St Luke it is written: “the Kingdom of God is within man” - not one man nor a group of men, but in all men! In you! You, the people have the power - the power to create machines. The power to create happiness! You, the people, have the power to make this life free and beautiful, to make this life a wonderful adventure. Then - in the name of democracy - let us use that power - let us all unite. Let us fight for a new world - a decent world that will give men a chance to work - that will give youth a future and old age a security. By the promise of these things, brutes have risen to power. But they lie! They do not fulfil that promise. They never will! Dictators free themselves but they enslave the people! Now let us fight to fulfil that promise! Let us fight to free the world - to do away with national barriers - to do away with greed, with hate and intolerance. Let us fight for a world of reason, a world where science and progress will lead to all men’s happiness. Soldiers! in the name of democracy, let us all unite!"
17,The General,1926,5,""

We use this subroutine to do the grunt work. Embedding it in the if, elsif, else conditions makes for clearer reading.

sub is_between {
my ($quotecount,$min,$max) = @_;
 return ($quotecount >= $min && $quotecount <= $max) ;
}

Let’s loop through the file and see if it works.

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);
use feature 'say';

my $csv = Text::CSV->new(
  {sep_char  => ',', binary => 1,
   auto_diag =>  1, eol     => $\
  }
);

# use the subroutine against if elsif else conditions
# don't put the conditions in the sub
# easier code to read with conditions outside subroutine

sub is_between {
my ($quotecount,$min,$max) = @_;
 return ($quotecount >= $min && $quotecount <= $max) ;
}

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

while (my $row = $csv->getline($fh)) {
  my %data ;
  @data{@fields} = @{$row} ;
  $data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
  
  if ( is_between($data{Quotecount},1,5) ) {
    say "ID $data{ID}  1-5 words";
    }
    elsif ( is_between($data{Quotecount},6,10) ) {
      say "ID $data{ID}  6-10 words";
      }
    elsif ( is_between($data{Quotecount},11,20) ) {
      say "ID $data{ID}  11-20 words";
      }
    elsif ( is_between($data{Quotecount},21,100) ) {
      say "ID $data{ID}  21-100 words";
      }
    elsif ( $data{Quotecount} > 100 ) {
      say "ID $data{ID}  Greater than 100 words";
      }
      # capture silent movies that don't have quotes	
      else {
        say "ID $data{ID}  Can't apply words to range";
      }
}
close $fh;

Which produces…

ID-1  1-5 words
ID-2  6-10 words
ID-3  1-5 words
ID-4  6-10 words
ID-5  21-100 words
ID-6  6-10 words
ID-7  11-20 words
ID-8  6-10 words
ID-9  1-5 words
ID-10  1-5 words
ID-11  1-5 words
ID-12  11-20 words
ID-13  1-5 words
ID-14  1-5 words
ID-15  1-5 words
ID-16  Greater than 100 words
ID-17  Can't apply words to range

So far so good. Bit messy with the formatting spaces. But this is a temporary solution, so that’s unimportant. We’ll adjust the main body of the code as we want to incorporate the above data into something improved.

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

my @array;
while (my $row = $csv->getline($fh)) {
  my %data ;
  @data{@fields} = @{$row} ;
  $data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
  
  if ( is_between($data{Quotecount},1,5) ) {
    $data{Range} = "1-5 words";
    }
    elsif ( is_between($data{Quotecount},6,10) ) {
      $data{Range} = "6-10 words";
      }
    elsif ( is_between($data{Quotecount},11,20) ) {
      $data{Range} = "11-20 words";
      }
    elsif ( is_between($data{Quotecount},21,100) ) {
      $data{Range} = "21-100 words";
      }
    elsif ( $data{Quotecount} > 100 ) {
      $data{Range} = "Greater than 100 words";
      }
      # capture silent movies that don't have quotes	
      else {
        $data{Range} = "Can't apply words to range";
      }
push (@array,\%data);
}
close $fh;
say pp (\@array);

We’ve now added $data{Range} to the %data hash, which in turn, gets pushed to @array (creating an array of hashes). Here’s a snippet of the output.

[
  {
    ID         => 1,
    Movie      => "The Terminator",
    Quotecount => 3,
    Quotes     => "I'll be back",
    Range      => "1-5 words",
    Rating     => 5,
    Year       => 1984,
  },
  {
    ID         => 2,
    Movie      => "Terminator 2: Judgement Day",
    Quotecount => 9,
    Quotes     => "I need your clothes, your boots and your motorcycle",
    Range      => "6-10 words",
    Rating     => 5,
    Year       => 1991,
  },
...
  {
    ID         => 17,
    Movie      => "The General",
    Quotecount => 0,
    Quotes     => "",
    Range      => "Can't apply words to range",
    Rating     => 5,
    Year       => 1926,
  }
]

It’s important to note that all this might seem excessive, but we have the data we need to provide an audit trail, even though the eventual aim is to provide just one element of this data to the client. We have Quotes, Quotecount and Range all nicely embedded to prove our data is accurate and reliable. Remember the first paragraph… as the analyst, you can never have too much data. Even if you never present all of it, have it locked away at your disposal.

Notice how the else condition has captured ID 17 as it is a silent film – and therefore no quote. This happens with real data even when you expect a response. People submit empty text boxes in the survey data world, even when they’re supposed to have an entry.

So far, the above doesn’t clearly tell a client their word count ranges without having to read between the rest of the data. We need to provide them with a count for each occurrence of each word count range.

We add this code.

my @count = map { $_->{Range} } @array;

What has this done? Let’s print the contents of @count to check what it contains.

say join ("," , @count);

1-5 words, 6-10 words, 1-5 words, 6-10 words, 21-100 words, 6-10 words, 11-20 words, 6-10 words, 1-5 words, 1-5 words, 1-5 words, 11-20 words, 1-5 words, 1-5 words, 1-5 words, Greater than 100 words, Can't apply words to range

We have the values of $_->{Range} (from @array) in @count. We can treat these as hash keys and count their occurrence (which is the hash value) that increases if seen more than once in a loop.

my %counter;
foreach (@count) {
  $counter{$_}++;
}
say pp (\%counter);

This gives us the desired output…

{
  "1-5 words" => 8,
  "11-20 words" => 2,
  "21-100 words" => 1,
  "6-10 words" => 4,
  "Can't apply words to range" => 1,
  "Greater than 100 words" => 1,
}

Here’s the complete code

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);
use feature 'say';

my $csv = Text::CSV->new(
  {sep_char  => ',', binary => 1,
   auto_diag =>  1, eol     => $\
  }
);

# don't put the conditions in the sub
# easier code to read with conditions outside subroutine

sub is_between {
my ($quotecount,$min,$max) = @_;
 return ($quotecount >= $min && $quotecount <= $max) ;
}

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

my @array;
while (my $row = $csv->getline($fh)) {
  my %data ;
  @data{@fields} = @{$row} ;
  $data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
  
  if ( is_between($data{Quotecount},1,5) ) {
    $data{Range} = "1-5 words";
    }
    elsif ( is_between($data{Quotecount},6,10) ) {
      $data{Range} = "6-10 words";
      }
    elsif ( is_between($data{Quotecount},11,20) ) {
      $data{Range} = "11-20 words";
      }
    elsif ( is_between($data{Quotecount},21,100) ) {
      $data{Range} = "21-100 words";
      }
    elsif ( $data{Quotecount} > 100 ) {
      $data{Range} = "Greater than 100 words";
      }
      # capture silent movies that don't have quotes	
      else {
        $data{Range} = "Can't apply words to range";
      }
push (@array,\%data);
}
close $fh;
say pp (\@array);

# sum the occurrences

# create an array from the specific hash values of 'Range'
my @count = map { $_->{Range} } @array;
say join (", " , @count);


my %counter;
foreach (@count) {
  $counter{$_}++;
}
say pp (\%counter);

Until the next time.

Perl Counting Spoken Words of a Silent Genius

In previous posts (CSV data parts 1, 2, 3 and 4) we had some movie quotes. The naive thinking to who has the longest quote, could be to find the length of the quote (i.e. the number of characters in the string that represents the quote).

But, that doesn’t always hold true when the longest quote is defined as which quote has the most words – and by words we mean any alpha and/or numeric characters including any accompanying commas, full-stops, exclamation marks etc. separated by a space or spaces.

As an example, here’s Arnie…

He'll be back: Arnold Schwarzenegger confirms 'Terminator ...
$ string="I'll be back"
$ echo ${#string}
12

And here’s Julie and Dick…

$ string="Supercalifragilisticexpialidocious"
$ echo ${#string}
34

So, in terms of words spoken, we can see Arnie has 3 words. Julie and Dick have 1 word. There isn’t a single space from start to finish of “Supercalifragilisticexpialidocious”. Therefore, if we define a quote as being longer than another quote by the amount of individual words, using the length function is the wrong option. We only needed to use the (Linux) command line to determine this.

Here’s a CSV file, and we’ve added one of the most powerful speeches from a movie to address humankind.

How would we go about creating a word count for each quote?

ID,Movie,Year,Rating,Quotes
1,The Terminator,1984,5,"I'll be back"
2,Terminator 2: Judgement Day,1991,5,"I need your clothes, your boots and your motorcycle"
3,Predator,1987,4,"Get to the chopper!"
4,Kindergarten Cop,1990,3,"I'm a cop, you idiot! I'm Detective John Kimble!"
5,Wild at Heart,1990,5,"Did I ever tell ya that this here jacket represents a symbol of my individuality, and my belief in personal freedom?"
6,Con Air,1997,4,"Put... the bunny... back... in the box."
7,Face/Off,1997,1,"You'll be seeing a lot of changes around here. Papa's got a brand new bag."
8,Kick-Ass,2010,4,"Tool up, honey bunny. It's time to get bad guys."
9,Some randon film,2000,1,"Some random quote."
10,Another random film,2001,1,"Another random quote."
11,Spinal Tap,1984,5,"well,it's one louder, isn't it?"
12,Snake Eyes,1998,3,"I saw you and you saw me, don't pretend like you don't know who I am girly man"
13,Raging Bull,1980,5,"""I could've been a contender"""
14,Dracula,1958,4,"I am Dracula"
15,Mary Poppins,1964,4,"Supercalifragilisticexpialidocious"
16,The Great Dictator,1940,5,"I’m sorry, but I don’t want to be an emperor. That’s not my business. I don’t want to rule or conquer anyone. I should like to help everyone - if possible - Jew, Gentile - black man - white. We all want to help one another. Human beings are like that. We want to live by each other’s happiness - not by each other’s misery. We don’t want to hate and despise one another. In this world there is room for everyone. And the good earth is rich and can provide for everyone. The way of life can be free and beautiful, but we have lost the way.Greed has poisoned men’s souls, has barricaded the world with hate, has goose-stepped us into misery and bloodshed. We have developed speed, but we have shut ourselves in. Machinery that gives abundance has left us in want. Our knowledge has made us cynical. Our cleverness, hard and unkind. We think too much and feel too little. More than machinery we need humanity. More than cleverness we need kindness and gentleness. Without these qualities, life will be violent and all will be lost. The aeroplane and the radio have brought us closer together. The very nature of these inventions cries out for the goodness in men - cries out for universal brotherhood - for the unity of us all. Even now my voice is reaching millions throughout the world - millions of despairing men, women, and little children - victims of a system that makes men torture and imprison innocent people. To those who can hear me, I say - do not despair. The misery that is now upon us is but the passing of greed - the bitterness of men who fear the way of human progress. The hate of men will pass, and dictators die, and the power they took from the people will return to the people. And so long as men die, liberty will never perish. Soldiers! don’t give yourselves to brutes - men who despise you - enslave you - who regiment your lives - tell you what to do - what to think and what to feel! Who drill you - diet you - treat you like cattle, use you as cannon fodder. Don’t give yourselves to these unnatural men - machine men with machine minds and machine hearts! You are not machines! You are not cattle! You are men! You have the love of humanity in your hearts! You don’t hate! Only the unloved hate - the unloved and the unnatural! Soldiers! Don’t fight for slavery! Fight for liberty! In the 17th Chapter of St Luke it is written: “the Kingdom of God is within man” - not one man nor a group of men, but in all men! In you! You, the people have the power - the power to create machines. The power to create happiness! You, the people, have the power to make this life free and beautiful, to make this life a wonderful adventure. Then - in the name of democracy - let us use that power - let us all unite. Let us fight for a new world - a decent world that will give men a chance to work - that will give youth a future and old age a security. By the promise of these things, brutes have risen to power. But they lie! They do not fulfil that promise. They never will! Dictators free themselves but they enslave the people! Now let us fight to fulfil that promise! Let us fight to free the world - to do away with national barriers - to do away with greed, with hate and intolerance. Let us fight for a world of reason, a world where science and progress will lead to all men’s happiness. Soldiers! in the name of democracy, let us all unite!"

Perl can accomplish this quite easily.

################################################################
# create an extra hash key value - word count from quote       #
# sort the order of words in quotes lowest to highest          #
# or ID if words in quotes are the same from different records #
################################################################

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);

my $csv = Text::CSV->new(
  {
   sep_char  => ',',
   binary    =>  1,
   auto_diag =>  1,
   eol       => $\
  }
);

my $file = 'Arnie_Nic.csv' ;
open (my $fh,'<:encoding(utf8)',$file) or die "Cannot open file: $!";

my @fields = @{ $csv->getline($fh) } ;

# create the holding array that the hash is pushed onto

my @array ;
while (my $row = $csv->getline($fh)) {
	
	# create the hash
	my %data ;
	# create the hash slice
	# de-reference the array ref '$row'
	@data{@fields} = @{$row} ;
	
	# add another entry to the hash
	# key is Quotecount value is number of words in Quotes
	
	$data{Quotecount} = map { $data{Quotes} } split ( /\s+/,$data{Quotes} );
	
	# push the hash onto the holding array
	# to create an array of hashes ;
	
	push (@array,\%data);
}
close $fh;

# sort by amount of words in Quote
# if words from different quotes are equal
# sort by ID

sub Quote_sorter {
	$a->{Quotecount} <=> $b->{Quotecount} ||
	$a->{ID}         <=> $b->{ID}
} 

my @sorted = sort Quote_sorter (@array);

printf "%-3s %-30s %-12s\n" , 'ID', 'Movie', 'WordsinQuote';
map { printf "%-3d %-30s %03d\n" , $_->{ID}, $_->{Movie}, $_->{Quotecount} }  @sorted;

Line 41 is doing all the heavy work. We’ve created an array of hashes from the file, and done it in such a way that we haven’t called column_names and getline_hr from the Text::CSV module.

We need’t have created this structure. We could have simply looped through the file and printed an added word count of quotes with the surrounding data from the CSV file.

But we’ve added some features on lines 54 – 62 to produce a nice report. Lines 54 – 57 are a subroutine to sort the array of hashes data structure. On line 59, we assign a new array (of hashes) which holds the sorted order of the data. See the comments on lines 50 – 52 for the theory to this sorting.

Line 61 prints a header. Line 62 prints the sorted array of hashes. We’ve opted for ID Movie and the newly created Quotecount, and omitted Year, Rating and Quotes. We’ve utilised printf to produce leading zeros for all but the word count for the quote from The Great Dictator.

ID  Movie                          WordsinQuote
15  Mary Poppins                   001
1   The Terminator                 003
9   Some randon film               003
10  Another random film            003
14  Dracula                        003
3   Predator                       004
11  Spinal Tap                     005
13  Raging Bull                    005
6   Con Air                        007
2   Terminator 2: Judgement Day    009
4   Kindergarten Cop               009
8   Kick-Ass                       010
7   Face/Off                       015
12  Snake Eyes                     018
5   Wild at Heart                  021
16  The Great Dictator             643

Until the next time.

Why you pushing me?

A question John Rambo asked, and those pushing him faced the consequences.

The push function in Perl isn’t as lethal as Rambo, and if you don’t use it, it won’t destroy your hometown. But there are instances when not using it could mess up your data.

Getting away with it – not pushing

Here’s a small file; 3 columns – an ID, band and lyric column.

1|Bowie|"Oh baby just you shut your mouth"
2|The Beatles|"and we are all together"
3|The Rolling Stones|"cold Italian pizza"
4|The Who|"the hypnotized never lie"

And here’s something very simple to parse it into a data structure

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dump qw(pp);

my %outerhash ;
my $file = "lyrics.psv";
open (my $fh,'<',$file) or die "Cannot open file: $!";
while (my $lines = <$fh>) {
  chomp $lines;
  my ($ID,$band,$lyric) = split (/\|/,$lines);
  
  $outerhash{$ID} = {
                     'band'  => $band,
	                 'lyric' => $lyric
                  };
} 
close $fh;
print pp(\%outerhash);

Which produces

{ 
1 => {band => "Bowie",lyric => "\"and we kissed, as though nothing could fall\""},
2 => { band => "The Beatles", lyric => "\"and we are all together\"" },
3 => { band => "The Rolling Stones", lyric => "\"cold Italian pizza\"" },
4 => { band => "The Who", lyric => "\"the hypnotized never lie\"" },
}

Nothing spectacular going on. A hash of hashes has been created.

But our file has been updated – and Bowie appears twice, with the ID of 1.

1|Bowie|"Oh baby just you shut your mouth"
2|The Beatles|"and we are all together"
3|The Rolling Stones|"cold Italian pizza"
4|The Who|"the hypnotized never lie"
1|Bowie|"and we kissed, as though nothing could fall"

Carry On Regardless without using the push function

We parse the data again just as we did previously. And we get our hash of hashes again.

All hunky-dory (a clue in that for Bowie fans)?

{
  1 => {band  => "Bowie",lyric => "\"and we kissed, as though nothing could fall\""},
  2 => { band => "The Beatles", lyric => "\"and we are all together\"" },
  3 => { band => "The Rolling Stones", lyric => "\"cold Italian pizza\"" },
  4 => { band => "The Who", lyric => "\"the hypnotized never lie\"" },
}

No, not really. What’s happened to our entry with the lyric to Bowie’s China Girl (“Oh baby just you shut your mouth”)? It’s not been included in our structure. Let’s try and fix that.

Using push – but getting it wrong

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dump qw(pp);

my %outerhash ;
my $file = "lyrics.psv";
open (my $fh,'<',$file) ;
while (my $lines = <$fh>) {
  chomp $lines;
  my ($ID,$band,$lyric) = split (/\|/,$lines);
  push ( $outerhash{$ID} , {
	                         'band'  => $band,
	                         'lyric' => $lyric
	                        } ); 
} 
close $fh;
print pp(\%outerhash);

If we run it – we get this error.

Not an ARRAY reference at lyrics2.pl line 12, <$fh> line 1.

What’s happened? Well, as much as I appreciate that man pages and perldoc pages can be sparse and confusing in detail; if we run perldoc -f push on the command line, we get scant information. But there is a salient line…

Treats ARRAY as a stack by appending the values of LIST to the
               end of ARRAY.

What we’ve done is to use the push function incorrectly. We’ve pushed a hash structure ( { ‘band’ => $band, ‘lyric’ => $lyric}) onto a hash key ($outerhash{$ID}).

What we should have done is tweak the structure and push onto $ID, treating it as an array (or array reference) and not treating $ID as a hash key. We fix that with the below.

Using push – getting it correct

my %outerhash ;
my $file = "lyrics.psv";
open (my $fh,'<',$file) ;
while (my $lines = <$fh>) {
  chomp $lines;
  my ($ID,$band,$lyric) = split (/\|/,$lines);
  push ( @{$outerhash{$ID}} , {
	                         'band'  => $band,
	                         'lyric' => $lyric
	                        } ); 
} 
close $fh;
print pp(\%outerhash);

Which produces…

{
  1 => [
         { band => "Bowie", lyric => "\"Oh baby just you shut your mouth\"" },
         {
           band  => "Bowie",
           lyric => "\"and we kissed, as though nothing could fall\"",
         },
       ],
  2 => [
         { band => "The Beatles", lyric => "\"and we are all together\"" },
       ],
  3 => [
         { band => "The Rolling Stones", lyric => "\"cold Italian pizza\"" },
       ],
  4 => [
         { band => "The Who", lyric => "\"the hypnotized never lie\"" },
       ],
}

And we can see that both Bowie entries are included as hashes within the same array.

Until the next time.

Perl Data – Thinking In Shapes

Data structures can be regarded as shapes. Yes, these shapes have their respective language naming conventions. Hashes, dictionaries, arrays, lists, tuples etc. And when these shapes become nested… the names for them can sometimes become confusing.

Structure is shape. Shape is visual. Visual teaches more clearly than spoken or written word.

Here are some examples of creating shapes. How slight tweaks in the code can change the way data is displayed, and therefore the way information is presented and reported.

We’ll use the __DATA__ filehandle with some fabricated data. If you want to test things with __DATA__, make sure it’s underneath everything else in the script.

To explain…1st column a band name, second column an audio format, 3rd column, sales for that audio format

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dump qw(dd);

my %hash;

while (<DATA>) {
 chomp;
 my @array = split (/,/,$_) ;
 $hash{$array[0]}->{band_count}++;
 $hash{$array[0]}->{sales_count} += $array[2];
}                                      
dd(\%hash);

__DATA__
alice in chains,mp3,1000
alice in chains,ogg,400
nirvana,mp3,10000
nirvana,flac,100
soundgarden,mp3,1000
soundgarden,wav,200
pearl jam,CD,10000
pearl jam,mp3,8000
pearl jam,mp3,6000

What are we doing? Line 8 reads in and loops over the data. Line 9 split the data (each line) on a comma from a string to an array. Line 11… assigns the 1st column as the outer hash key, which ensures that the 4 grunge bands will hold further data. But we’ve added (invented) an additional hash key – band_count and included ++ to it. So what’s going on with this; $hash{$array[0]}->{band_count}++; ? And this… $hash{$array[0]}->{sales_count} += $array[2]; ?

Best explained with a visual.

{
  "alice in chains" => { band_count => 2, sales_count => 1400 },
  "nirvana"         => { band_count => 2, sales_count => 10100 },
  "pearl jam"       => { band_count => 3, sales_count => 24000 },
  "soundgarden"     => { band_count => 2, sales_count => 1200 },
}

So this; $hash{$array[0]}->{band_count}++; has created “alice in chains”=> { band_count => 2.

And this $hash{$array[0]}->{sales_count} += $array[2]; has created “alice in chains” => { … , sales_count => 1400 }, .

The number of times the bands appear hash been counted (++) and is represented as the hash value of band_count. We see Pearl Jam appears 3 times. The number of sales per band hash been summed (+=) and appears as the value of sales_count.

With that template in mind, we can tweak the code to produce different results and different data shapes. Not very relational, but worth an inclusion

my %hash;

while (<DATA>) {
 chomp;
 my @array = split (/,/,$_) ;
 $hash{$array[1]}->{format_count}++;
}                                      
dd(\%hash);

Which produces how many times each format appears – and nothing else.

{
  CD   => { format_count => 1 },
  flac => { format_count => 1 },
  mp3  => { format_count => 5 },
  ogg  => { format_count => 1 },
  wav  => { format_count => 1 },
}

And just as one-dimensional

my %hash;

while (<DATA>) {
 chomp;
 my @array = split (/,/,$_) ;
 $hash{$array[0]}->{band_appearance}++;
}                                      
dd(\%hash);

Which produces how many times each band appears – and nothing else.

{
  "alice in chains" => { band_appearance => 2 },
  "nirvana"         => { band_appearance => 2 },
  "pearl jam"       => { band_appearance => 3 },
  "soundgarden"     => { band_appearance => 2 },
}

Let’s return to something a bit more meaningful.

my %hash;

while (<DATA>) {
 chomp;
 my @array = split (/,/,$_) ;
 $hash{$array[0]}->{band_count}++;
 $hash{$array[0]}->{sales_count} += $array[2];
 $hash{$array[0]}->{format_count}++;
}                                      
dd(\%hash);

Here, we’ve taken the first example we used and added $hash{$array[0]}->{format_count}++; to it in order to include a count of formats. Notice everything ties back to $hash{$array[0]} as our outer hash key.

This produces

{
  "alice in chains" => { band_count => 2, format_count => 2, sales_count => 1400 },
  "nirvana"         => { band_count => 2, format_count => 2, sales_count => 10100 },
  "pearl jam"       => { band_count => 3, format_count => 3, sales_count => 24000 },
  "soundgarden"     => { band_count => 2, format_count => 2, sales_count => 1200 },
}

But something’s not quite right here. We wanted our format_count to be specific to each audio format. Let’s change $hash{$array[0]}->{format_count}++; .

We’ll experiment with 2 options. The 1st option we’ll add $hash{$array[0]}->{$array[1]}++; . Notice anything different about this? We’ve not invented a hash key name to hold the data. As a result, we’re not going to get a ‘format_count’ => { } in our data.

my %hash;

while (<DATA>) {
 chomp;
 my @array = split (/,/,$_) ;
 $hash{$array[0]}->{band_count}++;
 $hash{$array[0]}->{sales_count} += $array[2];
 $hash{$array[0]}->{$array[1]}++;
}                                      
dd(\%hash);
{
  "alice in chains" => { band_count => 2, mp3 => 1, ogg => 1, sales_count => 1400 },
  "nirvana"         => { band_count => 2, flac => 1, mp3 => 1, sales_count => 10100 },
  "pearl jam"       => { band_count => 3, CD => 1, mp3 => 2, sales_count => 24000 },
  "soundgarden"     => { band_count => 2, mp3 => 1, sales_count => 1200, wav => 1 },
}

See how the audio formats are different? Because we haven’t given a hash key name, the audio format itself (mp3 etc. ) is the key name – the audio format count (++) is the value.

For the 2nd option, we’ll change that so the audio formats hold their own format_count key and value. But instead of this $hash{$array[0]}->{format_count}++; which we used previously, and only gave us an overall count of all formats, and not each format; we’ll change it to this $hash{$array[0]}->{$array[1]}->{format_count}++; .

my %hash;

while (<DATA>) {
 chomp;
 my @array = split (/,/,$_) ;
 $hash{$array[0]}->{band_count}++;
 $hash{$array[0]}->{sales_count} += $array[2];
 $hash{$array[0]}->{$array[1]}->{format_count}++;
}                                      
dd(\%hash);

Each of our audio formats are now inner keys and assigned to further format_count keys, where each one holds the count (value) for its respective format.

{
  "alice in chains" => {
                         band_count => 2,
                         mp3 => { format_count => 1 },
                         ogg => { format_count => 1 },
                         sales_count => 1400,
                       },
  "nirvana"         => {
                         band_count => 2,
                         flac => { format_count => 1 },
                         mp3 => { format_count => 1 },
                         sales_count => 10100,
                       },
  "pearl jam"       => {
                         band_count => 3,
                         CD => { format_count => 1 },
                         mp3 => { format_count => 2 },
                         sales_count => 24000,
                       },
  "soundgarden"     => {
                         band_count => 2,
                         mp3 => { format_count => 1 },
                         sales_count => 1200,
                         wav => { format_count => 1 },
                       },
}

In this post, we’ve tweaked a line here and there, which has greatly altered the shape of the data. We’ve used the name of the band as the outer key – so everything is indexed to that. But, we could have used the 2nd column of audio format as our base to shape and produce data shapes and structures.

Until the next time.