Adventures in Perl part 4

Delete your duplicates

Welcome to part 4.

In the previous posts we looked at how to inspect the validity of a CSV file, how to determine if it fits nicely in a spreadsheet, check whether it will create more columns than needed. Also, we wrote some simple conditions to check year and rating. We then determined we had duplicates based on an ID column, and decided that some were genuine duplicates, and some were not.

We have an updated CSV file for this post. Snake Eyes has a new ID of 12 (it was ID 2 – making it a duplicate ID). We have two new entries for ID 14 (a duplicate?), and an interesting entry of ID 13. What makes this quote different to all the other quotes is that it’s a quote quoting a quote. Eh? OK, it’s a Robert De Niro quote, but his quote is actually quoting Marlon Brando.

Our updated CSV file looks like this

Now to delete duplicates.

Firstly, as the file is (intentionally) small we can see duplicates in the ID column. We can also see that ID 1 which appears twice, has the same movie name, same release year, rating and quote. Duplicate ID 14 has the same movie name, a different year and rating, but same quote.

What to do? Let’s make the assumption that if the ID and Year columns are the same – it’s a duplicate, and delete one of them. And likewise, if the ID and Year columns are not the same, it stays. We’re going to create a new ID that merges ID and Year to do this in the second example.

As per previous posts, we loop through the data file… but we create a different data structure; a hash of hashes (or rather a hash of hash references. Since hash keys must be unique – this will prove useful.

A quick demonstration of what happens when the data is transformed into a hash of hashes.

# Arnie_Nic.pl to parse Arnie_Nic.csv
# checking for duplicates

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);
use feature 'say';
use Carp;

my $filename = 'Arnie_Nic.csv';

open (my $fh,'<',$filename) or die $!;
my $csv = Text::CSV->new( { sep_char   => q{,},
	                        quote_char => q{"}
	                      } );
croak if Text::CSV->error_diag();

#############################################
# using getline_hr - build a hash of hashes #
#############################################

my $header = $csv->getline($fh);
$csv->column_names( @{$header} );

# create the hash of hashes

my %hash;
while ( my $inner_hash = $csv->getline_hr($fh) ) {

  # using map

  $hash{$inner_hash->{'ID'}} =
    { map {$_ => $inner_hash->{$_} } keys %{$inner_hash} };
}
close $fh;
$csv->eof or $csv->error_diag();
say pp sort (\%hash);

# how many keys (IDs) do we have
say "There are ", scalar keys %hash , " IDs";
exit;

Lines 15 to 17 create the $csv object and declare the options. Lines 24 and 25 work in tandem to grab the header which in turn with line 30 starts a loop using getline_hr, using the column names as hash keys . Line 29 declares the hash (%hash) which will hold the data structure being created inside the while loop.

There’s a bit happening on lines 34 to 36. We create an outer hash key by using the data in the ID and Year column for this, and map the inner hash (reference) to this outer key. We close the file on line 37, report any errors in doing this on line 38.

We print the data structure – line 39. On line 42 we count the number of outer keys (IDs) and print the result.

Note, there is only 1 entry for ID 1 – it takes the last ID 1 in the file. There’s also only one entry for ID 14 – again taking the last entry of this ID in the file. It also prints that there are 14 keys, which can’t be correct as we have 16 lines of data (header row is not included) in the file. But as mentioned – hash keys must be unique.

Output shortened for output example

{
  1  => {
          ID => 1,
          Movie => "The Terminator",
          Quotes => "I'll be back",
          Rating => 5,
          Year => 1984,
        },
  2  => {
          ID => 2,
          Movie => "Terminator 2: Judgement Day",
          Quotes => "I need your clothes, your boots and your motorcycle",
          Rating => 5,
          Year => 1991,
        },
 ...
  12 => {
          ID => 12,
          Movie => "Snake Eyes",
          Quotes => "I saw you and you saw me, don't pretend like you don't know who I am girly man",
          Rating => 3,
          Year => 1998,
        },
  13 => {
          ID => 13,
          Movie => "Raging Bull",
          Quotes => "\"I could've been a contender\"",
          Rating => 5,
          Year => 1980,
        },
  14 => {
          ID => 14,
          Movie => "Dracula",
          Quotes => "I am Dracula",
          Rating => 3,
          Year => 1931,
        },
}
There are 14 IDs

However, it’s a a good start. We just need to tweak a few things. We want to keep both ID 14 entries from the CSV file. Let’s create a new ID, and assign (not using map) everything to that. Having done that, if the new ID still proves to be a duplicate, we’ll delete it. If not, it stays.

Our new ID is the concatenated data from the ID and Year columns.

# Arnie_Nic.pl to parse Arnie_Nic.csv
# checking for duplicates

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);
use feature 'say';
use Carp;

my $filename = 'Arnie_Nic.csv';

open (my $fh,'<',$filename) or die $!;
my $csv = Text::CSV->new( { sep_char => q{,},
	                        quote_char => q{"}
	                      } );
croak if Text::CSV->error_diag();

#############################################
# using getline_hr - build a hash of hashes #
#############################################

my $header = $csv->getline($fh);
$csv->column_names( @{$header} );

# create the hash of hashes

my %hash;
my %dups_deleted;
while ( my $inner_hash = $csv->getline_hr($fh) ) {
	
	# create a new outer key ID
	my $outer_key = join ( '-' , $inner_hash->{ID} , 
                                 $inner_hash->{Year}
                          );

    # get the order correct...we're in a loop
    # assign to %dups_deleted as the %hash is being built
    # delete comes first
    # duplicates based on new ID
    
    $dups_deleted{$outer_key} = delete $hash{$outer_key}
    if ( exists $hash{$outer_key} ); 
    
    # not using map
    # unique based on new ID
    $hash{$outer_key} = $inner_hash;
    
    
}
close $fh;
$csv->eof or $csv->error_diag();

# how many keys (new IDs) do we have
say "There are ", scalar keys %hash , " unique IDs";
say "They are";
say pp (\%hash);

# how many keys (new IDs) do we have in the duplicates data
say "There are ", scalar keys %dups_deleted , " IDs deleted from the main hash";
say "They are";
say pp (\%dups_deleted);
exit;

To outline the differences in this version, on line 30 we have an additional hash %dups _ deleted to hold any newly created ID that duplicate. We create this new ID (combination of ID and Year using a “-” symbol) on lines 34 to 36. This means that our IDs of 1 with the same year can be treated as a duplicate, and our IDs of 14 with a different year not considered duplicates.

Lines 43 and 44 might seem confusing, not only in syntax but their position in the script before the main hash is created on line 48. Effectively lines 43 and 44 are stating that if the new ID has been seen before (i.e. if it already exists in %hash) the ID and its data are deleted from %hash and go into the %dups_deleted hash. Which means if it hasn’t been seen before (i.e. if it is unique) the data stays in %hash.

If you put the code on line 48 before the codes on lines 43 and 44, %hash will be empty, and %dups_deleted will hold everything.

Line 48 creates the hash that hasn’t been picked up and put in %dups_deleted. This line replaces the map function that was used in the previous example… just to prove there is more than one way to do it in Perl.

Lines 56 to 63 determine how many keys are in each hash, and displays the results

Contents of unique IDs shortened for output example

There are 15 unique IDs
They are
{
  "1-1984"  => {
                 ID => 1,
                 Movie => "The Terminator",
                 Quotes => "I'll be back",
                 Rating => 5,
                 Year => 1984,
               },
...
  "14-1931" => {
                 ID => 14,
                 Movie => "Dracula",
                 Quotes => "I am Dracula",
                 Rating => 3,
                 Year => 1931,
               },
  "14-1958" => {
                 ID => 14,
                 Movie => "Dracula",
                 Quotes => "I am Dracula",
                 Rating => 4,
                 Year => 1958,
               },
}

We can see that %hash keeps one entry for ID 1 (updated as 1-1984), and keeps both entries for ID 14 with their new IDs of 14-1931 and 14-1958 as their updated ID means they are no longer duplicates.

Contents of duplicate IDs

There are 1 IDs deleted from the main hash
They are
{
  "1-1984" => {
    ID => 1,
    Movie => "The Terminator",
    Quotes => "I'll be back",
    Rating => 5,
    Year => 1984,
  },
}

This makes sense. 16 lines of data can be explained. 15 newly created IDs (keys) are unique, and 1 (one of the 1-1984 IDs) is a duplicate, We chose a hash (or rather a hash of hash references) to achieve this exercise.

Modern Perl is a viable option to re-shape and wrangle data sets.

Until the next time.

Adventures in Perl part 3

Data duplication or not duplication… that is the question

In parts 1 and 2 we looked at messy CSV data and how to identify it with Perl. We’ve been using a small CSV file so we can actually see its entire contents; the good, the bad, and the ugly lines – and how Perl can be used to flag inconsistencies.

In this post we look at potential duplication. Our file has some additional lines since the previous two posts; all quoting is correct, no dodgy formatting, all hunky-dory in that respect. In this post, Nicolas Cage has joined Arnold Schwarzenegger as the King of Movie Quotes.

In order to establish possible duplication, we need to decide what column to use. In this file it might seem obvious which column’s data is to be scrutinised for duplication. But what if there was a sixth column in this file that contained an IMDB ID? Surely that would be a “key” worthy of querying. Every IMDB key has to be unique right?

We wouldn’t be so naive to consider that the Year column could be used as the column to check for duplication? Of course not – calculate how many films in each year, yes – but common sense tells us you can have many films made/released in the same year. The same principal applies to the Rating column.

But, as you might have guessed, the ID column is the unique key to consider whether duplication has occurred, where each entry (aka record) is meant to have a unique numeric value.

Here we utilise the Text::CSV module more than in parts 1 and 2 by using the getline, column_names and getline_hr features to use the header row rather than skipping it, and referring to column names rather than array indices. In doing this, we’re creating outer hash keys that hold arrays, which in turn have elements that are hash keys and values. This can be considered a hash of array of hashes. Strictly speaking, the inner hashes and arrays are references to the outer hash keys. Further reading links on this at the end of the post.

This creates the data structure

# Arnie_Nic.pl to parse Arnie_Nic.csv
# checking for duplicates

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use Data::Dump qw(pp);
use feature 'say';

my $csv = Text::CSV->new( {sep_char => q{,}, quote_char => q{"} } );
my $file = 'Arnie_Nic.csv';

# we need an Arnie_Nic.csv file to continue
open (my $fh,'<',$file) or die "Can't open file: $!";

######################################################
# using getline_hr - build a hash of array of hashes #
######################################################

my %hash;

# grab the header to use

my $header = $csv->getline($fh); 
$csv->column_names( @{$header} );

while ( my $lines = $csv->getline_hr($fh) ) {
	chomp $lines;
	
	# print columns if want to
	# say "$lines->{ID} $lines->{Year} $lines->{Movie}";
	
	# create the data structure
	push ( @{$hash{$lines->{ID}}} , {
		                               'Year'   => $lines->{Year},
		                               'Movie'  => $lines->{Movie},
		                               'Rating' => $lines->{Rating},
		                               'Quotes' => $lines->{Quotes}
								     } );
}
$csv->eof or $csv->error_diag();
close $fh or die "Cannot close $file";
say pp (\%hash);
exit;

Line 21 declares the intended outer structure as a hash. We need some keys that hold some values.

Lines 25 and 26 work together to take the first line in the CSV file, and tell line 28 to use the column names as the hash keys.

Line 32, which is commented out, emphasises that whilst in the while loop we can print the contents of the CSV file just as we did in parts 1 and 2. We can put an if condition here if we needed.

Lines 35 – 41 are doing the grunt work. It’s instructing Perl to push the Year, Movie, Rating and Quotes data onto its corresponding ID as the data is being looped over.

We close the while loop, and print the results with Data::Dump. It displays what is obvious with such a small file. ID 1 contains two array elements (which are hashes within the array). There’s matching data for all columns – clearly a duplicate

ID 2 also contains two hashes as array elements… clearly not a duplicate. More than likely that given our movie Snake Eyes comes after ID 11 in the CSV file, the ID for this movie should be 12 and not 2.

{
  1  => [
          {
            Movie  => "The Terminator",
            Quotes => "I'll be back",
            Rating => 5,
            Year   => 1984,
          },
          {
            Movie  => "The Terminator",
            Quotes => "I'll be back",
            Rating => 5,
            Year   => 1984,
          },
        ],
  2  => [
          {
            Movie  => "Terminator 2: Judgement Day",
            Quotes => "I need your clothes, your boots and your motorcycle",
            Rating => 5,
            Year   => 1991,
          },
          {
            Movie  => "Snake Eyes",
            Quotes => "I saw you and you saw me, don't pretend like you don't know who I am girly man",
            Rating => 3,
            Year   => 1998,
          },
        ],
  3  => [
          {
            Movie  => "Predator",
            Quotes => "Get to the chopper!",
            Rating => 4,
            Year   => 1987,
          },
        ],
  4  => [
          {
            Movie  => "Kindergarten Cop",
            Quotes => "I'm a cop, you idiot! I'm Detective John Kimble!",
            Rating => 3,
            Year   => 1990,
          },
        ],
  5  => [
          {
            Movie  => "Wild at Heart",
            Quotes => "Did I ever tell ya that this here jacket represents a symbol of my individuality, and my belief in personal freedom?",
            Rating => 5,
            Year   => 1990,
          },
        ],
  6  => [
          {
            Movie  => "Con Air",
            Quotes => "Put... the bunny... back... in the box.",
            Rating => 4,
            Year   => 1997,
          },
        ],
  7  => [
          {
            Movie  => "Face/Off",
            Quotes => "You'll be seeing a lot of changes around here. Papa's got a brand new bag.",
            Rating => 1,
            Year   => 1997,
          },
        ],
  8  => [
          {
            Movie  => "Kick-Ass",
            Quotes => "Tool up, honey bunny. It's time to get bad guys.",
            Rating => 4,
            Year   => 2010,
          },
        ],
  9  => [
          {
            Movie  => "Some randon film",
            Quotes => "Some random quote.",
            Rating => 1,
            Year   => 2000,
          },
        ],
  10 => [
          {
            Movie  => "Another random film",
            Quotes => "Another random quote.",
            Rating => 1,
            Year   => 2001,
          },
        ],
  11 => [
          {
            Movie  => "Spinal Tap",
            Quotes => "well,it's one louder, isn't it?",
            Rating => 5,
            Year   => 1984,
          },
        ],
}

This is all well and good – but to most the evidence is not very readable. We need to add something that makes the analysis easy to understand.

$csv->eof or $csv->error_diag();
close $fh or die "Cannot close $file";
say pp (\%hash);

# added lines under here...

say "These IDs appear more than once...";
# loop through outer hash keys (ID)
foreach my $id ( keys %hash ) {
	
	# count how many elements in each array
	# i.e. how many hashes in each array
	
	my $duplicate_check = scalar @{$hash{$id}};
	
	# get to the inner hash inside the array
	foreach my $inner_arr ( @{$hash{$id}} ) {
		
		# print some stuff if array hash more than one
		# hash in its array
		say "ID $id $inner_arr->{Year} $inner_arr->{Movie}"
		if $duplicate_check > 1;
  }
}
exit;

Line 7 is just to print a heading

Line 9 we loop through the outer hash keys (aka the IDs from the CSV file)

Line 14 assigns a count to the variable $duplicate_check. It calculates how many elements are in the array. And because the elements are hashes, it counts how many of these in each array per outer hash key (ID).

Line 17 then loops over the arrays to get at the inner hashes.

Line 21 prints these inner hashes – we choose to print ID which has already been assigned to $id from the first loop on line 9 – plus Year and Movie. But it prints this only if the value of $duplicate_check is greater than 1.

This gives us the more understandable output…

These IDs appear more than once...
ID 1 1984 The Terminator
ID 1 1984 The Terminator
ID 2 1991 Terminator 2: Judgement Day
ID 2 1998 Snake Eyes

Try it on a larger file – it works just as well.

That’s it for now. Next time – how to change the IDs from the existing data – and some code to delete it based on certain conditions.

Further reading of Perl data structures

Maybe a little complex for a first look, but some great templates used https://perldoc.perl.org/perldsc.html

Modern Perl by Chromatic provides clear snippets of arrays, hashes and nested assortments of these, as well as many other topics http://www.onyxneon.com/books/modern_perl/index.html. It’s free as a PDF.

Data Munging with Perl by Dave Cross is free as a PDF. Chapter 2 is very clear and concise about how to shape data from files. https://perlhacks.com/dmp.pdf.

If you want to a gentle introduction to Perl, Dave Cross has also published A Perl Taster – available here https://leanpub.com/perl-taster or here https://www.amazon.com/Perl-Taster-Your-First-Hours-ebook/dp/B077MB1LH1.

A beast of a book is Beginning Perl by Curtis “Ovid” Poe. It’s well structured, gets progressively more difficult, but the basics are clearly explained. Throughout, there is plenty of explanation to the code examples. It’s available here https://www.amazon.co.uk/Beginning-Perl-Curtis-Poe-ebook/dp/B009K92D9I.

I’m not affiliated to any of the above.. I include them purely as recommendations for anyone wanting to work with Perl.

Adventures in Perl part 2

The CSV file – Trust No One

That’s right. Do as Fox Mulder from The X-Files would do.

When you are handed a CSV file, don’t trust it. I once merrily loaded sports data into a system taking it at face value, only to discover that an individual whose swim lap times could rival Michael Phelps was 100 years old. Alien DNA or dubious data?

Taking our ‘Arnie.csv’ file from Adventures in Perl part 1, we’ve got a few more additional lines intentionally added to help with the exercise. Note on the last 3 lines the Quotes column… each phrase contains commas – but the quote is not encapsulated with quotes.

What do we already know? Well, we want 5 columns; ID,Movie, Year, Rating and Quotes. We’ve learnt that by default, without any advanced understanding, unquoted text with a comma is likely to create additional, unwanted columns… as the comma is acting as the delimiter.

We also know that missing quote characters on lines of data, can create concatenation – joining separate records together (see ID 2 and ID 3 in Adventures in Perl part 1). We flagged dodgy structured CSV data with Perl’s Text::CSV module (https://metacpan.org/pod/Text::CSV) and the parse feature.

We left off by establishing that additional, unwanted columns can’t really be classified as dodgy structured CSV. It’s really the user’s job to know what is and what isn’t needed – and deal with it accordingly.

The same can be said for the actual content of the data. What to do with it will depend on the project, clients’ requirements etc. Can it be ignored? If it’s clearly incorrect, should it be deleted? Should it be corrected?

For example, we have a Year column which contains a 4 digit figure representing the release date of the movie. How would we ensure that all the data in this column didn’t contain anything but some sensible caveats? For example, if we assumed that movies have been made since the nineteenth, twentieth and twenty-first century, how could we flag anything that didn’t comply? How would we flag anything that doesn’t conform to 4 digits; i.e we don’t want any ’90, ’93 entries.

And for the rating, if the lowest is 1, and the highest is 5, how can we flag any rating that doesn’t conform to this range?

# Arnie.pl to parse Arnie.csv
# checking for csv format errors
 
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
use feature 'say';
 
my $csv = Text::CSV->new( {sep_char => q{,}, quote_char => q{"} } );
my $file = 'Arnie.csv';
 
# we need an Arnie.csv file to continue
open (my $fh,'<',$file) or die "Can't open file: $!";
 
# remove the header row
# we don't want it included in if conditions
 
my $header = <$fh>;
while (my $lines = <$fh>) {
    chomp $lines;
     
    # deal with the lines that do parse OK
    if ( $csv->parse($lines) ) {
        my @arr = $csv->fields();
        # don't bother with non-capturing in the parenthesis
        # if the Year coloumn doesn't start with 18, 19 or 20
         
        if ( $arr[2] !~ /(^1[89]|20)\d\d/ ) {
            say "ID $arr[0] $arr[1] has date $arr[2] which looks odd";
            }
             
        my $rating = qr/[1-5]/; 
        if ( $arr[3] !~ /$rating/ ) {
                say "ID $arr[0] has wrong rating $arr[3]";
            }
        # check for spreadsheet import issues   
        # if more than the allocated 5 columns
        if ( scalar @arr > 5 ) {
            # assign 6th to end column to a slice
            my @slice = @arr[5 .. $#arr];
            say '';
            say "This entry creates a  spreadsheet split";
            say "Quote is split over more than one column: " , @slice;
            say $csv->string;
            }
        }
        # the lines that don't parse OK
            else {
          warn "Problem parsing line $.  error ", $csv->error_diag();
            }
}
close $fh;
exit;

Again, we’re not utilising Text::CSV too much. We’re taking some extreme liberties with line 29. As it stands, there’s nothing flagging a film being released in 1800, which would be too early. There’s also nothing that would flag a film released in any future date up to 2099.

We have 3 embedded if conditions within the outer if condition (line 24), that flag Year (line 29), Rating (line 34) and potential unwanted columns (line 39). So if the line is parsed AND the condition of year is met – it’s flagged. The same pattern applies for parsing the line AND the rating, parsing the line AND checking columns after the Quotes column.

This produces…

Note that ID 9 has been flagged both on year and rating, as well as being flagged on data that appears after the Quotes column.

ID 9 and ID 10 would seem to have data in columns after the Quotes column, that is superfluous to requirements, if we take “unwanted column” and “another unwanted column” as not being part of a movie quote.

But, ID 11 looks as if it has a genuine movie quote which has created additional columns after the Quotes column – as it has commas and isn’t surrounded in quotes. This would span across columns in a spreadsheet rather than be grouped in our Quotes column.

Again, all quite easy to spot and identify in such a small CSV file – we can tell just by looking at it. But, if this were a a larger CSV file and the task had to be repeated; flagging anomalies in this manner is invaluable.

In part 3, we’ll look into data structures, and what they can offer.

Adventures in Perl part 1

The CSV file – quotes and terminators – does my spreadsheet look good?

There’s no getting away from the CSV file; readable in text editors and spreadsheets on Mac, Windows and Linux. Spreadsheets being the most popular method to read CSV files is hardly surprising. Open up Excel or LibreOffice Calc and we get tidy, formatted data.

Let’s look at the contents of “Arnie.csv” in a spreadsheet; 5 columns – ID, Movie, Year, Rating, Quotes. Looks O.K. right?

No, it doesn’t. It’s a complete mess. ID 2 data also contains all the data for ID 3. And we seem to have an extra, unwanted column with half the quote for ID 4. In other words – the quote has split in half.

To understand the reason why the spreadsheet looks inaccurate, we look at “Arnie.csv” in a text editor. Any will suffice; even a non-fancy one such as Notepad on Windows. We also need to take into account that a comma “,” (traditionally) separates columns in a CSV file. So, this applies to the ID, Movie, Year, Rating, and Quotes columns.

But the file’s appearance in the text editor looks nothing like how it appears in our spreadsheet data. Yes, it’s ugly. But we can clearly see that each film’s data is on a separate line. ID 2 and ID 3 are NOT concatenated on one line. If we go through the quotes it might give some insight to the behaviour of the spreadsheet.

ID 1 – “I’ll be back”. Surrounded in double-quotes. Renders fine in the spreadsheet. Technically – it doesn’t need to be in quotes as there is no comma in the quote. More on this for ID 4.

ID 2 – “I need your clothes, your boots and your motorcycle. We have a missing closing double-quote after motorcycle. If we then look at ID 3 – Get to the chopper” – we have a missing opening double-quote before Get.

ID 4 – I’m a cop, you idiot! I’m Detective John Kimble! – no quotes at all. There’s a comma after I’m a cop and the quote continues; you idiot! I’m Detective John Kimble!

Quotes are essential in CSV files when structuring data. Without them a comma that isn’t intended to separate columns will simply do the opposite and separate columns. This is why; you idiot! I’m Detective John Kimble! appears in an extra, unwanted column after the Quotes column. We have an unquoted comma in this Arnie speech that is not intended to act as a delimiter.

Usually, when going the other way – having the original data in a spreadsheet, and saving it as CSV, the quoting of data is done for you, and the quote would appear as “I’m a cop, you idiot! I’m Detective John Kimble!” in the CSV file.

A manual fix

Since it’s only 4 lines of data (1st line is a header row), let’s manually adjust the CSV file and see if that makes a difference to the appearance in a spreadsheet.

We’ve ensured each quote is surrounded with double-quotes. How has this affected the way the spreadsheet appears?

That’s more like it. The spreadsheet looks good – everything is where it should be.

A quick CSV parser

But, there are only 4 lines of data. What if there were 40,000? It would be impossible to spot all poorly structured data. We couldn’t cross-reference every line in the spreadsheet and text editor in order to find anything that might need fixing.

What’s needed is a tool to parse CSV. Something to report where the CSV data might not be formatted correctly. We’ll use Perl and the Text::CSV module (https://metacpan.org/pod/Text::CSV). This module has a whole bag of tricks, but for this post, we just want to write some Perl that will flag what it considers as dodgy CSV. For that we are using the ‘parse’ feature of the module.

# Arnie.pl to parse Arnie.csv
# checking for csv format errors

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;

# allow for double-quotes...
# at this stage we don't know if Quotes column has them

my $csv = Text::CSV->new( {sep_char => q{,}, quote_char => q{"} } );
my $file = 'Arnie.csv';

# we need an Arnie.csv file to continue
open (my $fh,'<',$file) or die "Can't open file: $!";
while (my $line = <$fh>) {

	# if not sounds more human than unless
	if ( !$csv->parse($line) ) {
		warn "Problem parsing line $.  error ", $csv->error_diag();
		}
}
close $fh;
exit;

Line 7 tells the script we are using the Text::CSV module. Line 12 declares $csv as an object and which options are going to be used with it.

Two things of note; you can name the object anything you want, it doesn’t have to be $csv. But this naming tends to be standard practise in the Perl world, so we’ll obey the rules. I’m declaring the delimiter (sep_char) as a comma on line 12. This may seem odd, when the module is called Text::CSV and we are working with a CSV file. But Text::CSV can be used for tab-delimited files, or any other delimiter; so it’s good practise to be verbose about things.

Line 13 we declare the name of the CSV file we want to read. Line 16 we open up the file to read, and assign it to a file handle ($fh). If the file cannot be read, the script complains and doesn’t go any further. Much like $csv… you can call this file handle what you want (within reason). Don’t call it $out for example. That would suggest you are sending data out to a file from the script, rather than reading it into the script.

Line 17 starts a loop, reading ‘Arnie.csv’ line-by-line. Line 20 reads as; “if this line of CSV data cannot be parsed”. Line 21 reads as “then complain about why it can’t be parsed, giving the line number, and the reason why you are complaining.”. We close the file for reading on line 24. We exit the script on line 25.

And our output.

Problem parsing line 3  error 2021EIQ - NL char inside quotes, binary off9035 at Arnie.pl line 21, <$fh> line 3.
Problem parsing line 4  error 2034EIF - Loose unescaped quote3845 at Arnie.pl line 21, <$fh> line 4.

We can see (much as we could manually) that there are quote issues with line 3 and line 4. (which is ID 2 and ID 3). If ‘Arnie.csv’ were a larger file, all potential errors would be flagged. An important note is… it has not flagged the split quote for ID 4. There is no error message about an extra column produced for I’m a cop you idiot…Technically, that is not dodgy CSV.

There’s no getting away from having to fix your data if you want accurate results. We can cheat to a certain degree by allowing loose_quote options. Visit the Text::CSV CPAN page if you really want to venture into this potential nightmare (https://metacpan.org/pod/Text::CSV). But, if you send bad data in – you’ll get bad data out. So, best fix it if the data is intended for further reporting, charts, graphs etc.

In Adventures in Perl part 2 , we’ll look at how to write a manual fix to pick up any extra, unwanted columns. Plus, ensure that our Year and Rating columns only allow for dates and scores.

Analogue Waves Ambient Space Music On YouTube

I’ve started a YouTube channel, where you can view the first video here

https://www.youtube.com/watch?v=-3cvq7hC2rk

Please click the like button if it’s to your liking.

There will be more coming shortly… so don’t forget to subscribe once the content starts flowing.

Until the next time.

Concert Tickets From The 80s and 90s

Some of these hail from the days when your ticket was ripped in two, much like a bus ticket was.

Top right belongs to U2‘s Joshua Tree tour 1987. If memory serves, for I was wee then; The Pogues were supporting. This was July 1987. A month later they would record a little Christmas song called Fairytale of New York.