Sorting with PERL

I recently had to sort a rather large file for a client of mine using a PERL script. The output had to be a unique list of the input file.

The client

was running Windows and the sort function in DOS did not seem to

have a -u parameter as it does in Linux.

Since the client already had PERL installed (for the other

script), I decided to write a sort by unique script in PERL. I

was quite surprised by the results. The file in question was

326MB* (a pipe-delimited scrape of the business listings on Yellow Pages.

Using time cat listings.csv | sort -u >> test.csv took

approximately 6 minutes and 30 seconds. Sorting the same file

with my PERL script took approximately 10 seconds.

The Linux sort function is written in C. I find this interesting

as C is generally much faster than PERL (although design is far more

important for optimization than is the speed of the language). Since

my little script obviously isn’t the result of some ingenious design,

I think what this best illustrates is that certain languages are

best for certain jobs because of their inherent data-structures.

Some data-structures are simply better suited for certain jobs

and allow for simpler algorithms.

#!/usr/bin/perl

use strict;

(our $input, our $output) =@ARGV;

our %uniques = ();

open(INPUT, “< $input”) or die “Cannot open input file $input…n”;

open(OUTPUT, “>> $output”) or print “Cannot open output file $output…n”;

while(my $line = )

{ $uniques{$line} = $line; }

close(INPUT);

foreach my $key (sort (keys %uniques))

{ print OUTPUT $key; }

close(OUTPUT);

sub ksort()

{ $uniques{$b} <=> $uniques{$a}; }

* This file contained a large amount of redundancy – the resulting

output was only 1MB (.003 of the original).

Leave Your Response