Allmusic ID3 Tag Fixer
Introduction:
One of the major problems with using Allmusic.com as a metadata source is the difficulty in searching for the correct ablbum. Ideally we want to be searching for both Artist and Album together. However Allmusic.com only allows for searching of either album or artist not both. So the best way I have determined to match the album "Greatest Hits" is to first search for the artist and then search the discography for that album.
Part 2: Downloading Artist ID from Allmusic
Allmusic.com uses individual sql IDs to identify each artist and album. So what we need is a script to search and parse out the correct sql ID and add it to our mySQL database. I decided to use perl over php because this script will be run from a command prompt and not a browser because of the lenght of time necessary to complete it’s task.
?
It is pretty simple to use libwww to download an html page. In order to get the most accurate results we post our data to Allmusic.com. Then we need to parse out the correct information from the downloaded page. The main problem I ran into was how to make sure that we had accurately picked the right artist. I discovered a very simple perl module named string::compare. As I have said it is very basic and returns a value from 0 to 1, 1 being an identical string. After we parse the info we just need to add it to our mySQL database.
Now there is a tricky part with Allmusic.com. In order to prevent commercial websites from overwhelming the Allmusic.com?website they impose a ban if you request more pages in a given period of time then what they are comfortable with. In order to prevent this from happening I had to add a delay to the script which causes it to take nearly three times as long. I ran through a list of about 700 artists in about an hour.
The script:
#!/usr/bin/perl
use HTTP::Request::Common qw(POST);
use LWP::UserAgent;
#use our little string comparision module
use String::Compare;
#These are the same mySQL functions again
sub Create_DB_Connection{
?? use DBI;
?? $DSN? = "DBI:mysql:database=DATABASE;host=HOST";
?? $user = "USERNAME";
?? $pw?? = "PASSWORD";
?? $dbh? = DBI->connect($DSN,$user,$pw)
???? || die "Cannot connect: $DBI::errstr\n" unless $dbh;
?? return;
?} # End of Create_DB_Connection subroutine.
? sub Do_SQL{
??? eval{
????? $sth = $dbh->prepare($SQL);
??? }; # End of eval
??? # Check for errors.
??? if($@){
????? $dbh->disconnect;
????? print "Content-type: text/html\n\n";
????? print "An ERROR occurred! $@\n";
????? exit;
? } else {
??? $sth->execute;
? } # End of if..else
? return ($sth);
} # End of Do_SQL subroutine
?sub filter{
?? $_[0]=~s/\'/\'\'/g;
?? $_[0]=~s/\\/\\\\/g;
?? return $_[0];
?} # End of filter subroutine
&Create_DB_Connection;
#We want distinct artists that we have no comparision
#value for or its comparision value is less than 65%
#This enables us to run the script multiple times
#without requesting an artist twice that we already
#know is correct.
$SQL = "SELECT DISTINCT artist FROM albums WHERE ";
$SQL = $SQL . "artistcomp < '.65' OR artistcomp IS NULL";
&Do_SQL;
$i=0;
while ($pointer = $sth->fetchrow_hashref){
? $sqlArtist = $pointer->{'artist'};
? push(@db, $sqlArtist);
}
foreach $sqlArtist (@db){
$i++;
###Search URL TO Use
$url = "http://www.allmusic.com/cg/amg.dll";
##hehe The script runs a little faster than all
##music would like it looks like the have a max
##of 50 searches in a given period of time
##to prevent you from being banned we will sleep
##for 2 minutes for every 25 search requests.
if ($i == 25){
sleep 120;
$i = 0;
}
##Get Web Page
$ua = new LWP::UserAgent;
$req = POST $url, [ P => 'amg', OPT1 => '1', SQL => $sqlArtist];
$total = $ua->request($req)->as_string;
##Split out all tabs and end of lines
$total =~ tr/\n\r\t / /s;
$locate = index($total,'width:190px;word-wrap:break-word;',0)+ 33;
if ($locate < 33){
? ##We are not on a list page or there was an error
? ##NEED TO ADD CHECK TO SEE IF WE HAVE BEEN TAKEN TO THE ARTIST PAGE
? $locate = index($total,'Discography',0) - 40;
? if ($locate < 0){
??? ##Wow I really have no idea were we are must be an error page
##or the artist does not exist, lets skip it and try another
??? next;
? }
? $locate = index($total,'sql=',$locate) + 4;
? $end = index($total, '~', $locate);
? $where = $end - $locate;
? $artistid = substr($total,($locate),$where);
?
? $locate = index($total,'class="title">',0) + 14;
? $end = index($total, '<', $locate);
? $where = $end - $locate;
? $artist = substr($total,($locate),$where);
}
else {
? ##We are on the list page.
? $locate = index($total,'sql=',$locate) + 4;
? $end = index($total, '"', $locate);
? $where = $end - $locate;
? $artistid = substr($total,($locate),$where);
?
? $locate = index($total,'>',$locate) + 1;
? $end = index($total, '<', $locate);
? $where = $end - $locate;
? $artist = substr($total,($locate),$where);
}
#Escape characters that would cause SQL errors
$artistcomp = compare($artist, $sqlArtist);
$artist = filter($artist);
$artistid = filter($artistid);
$sqlArtist = filter($sqlArtist);
$SQL = "UPDATE albums SET artistid = '$artistid', newartist = '$artist',";
$SQL = $SQL . " artistcomp = '$artistcomp' WHERE artist = '$sqlArtist'";
&Do_SQL;
print $i . "\n";
}
$dbh->disconnect;
Notes:
I would leave the sleep function alone. If you alter it too much you will get temporarily or possibly permanetly.
As I have stated before this is a very rudimentary script. I have only designed it to run from the command line. I imagine that you could make this much more user friendly or design it into a website. If you did this I would recommend only 20-25 artists at a time.
Next:
After we have identified all of the artists we need to make sure that we have identified them correctly. The easiest way to do this is partially using the comparision value. I have found that anything above 65% has only 1 error in about 200. So next I will show you a simple PHP script to look at and correct any errors that were missed or created in the parsing process.