perl LWP::Simple

serpretetsky

2[H]4U
Joined
Dec 24, 2008
Messages
2,180
I'm trying to use perl and LWP::Simple to make a small script that gets the html code for a google search. Currently I just have it print out the code and URL it used to get the code in the "returnHtml" subroutine.

The problem is that LWP::Simple is only working on just the domain name, not the whole URL. Is this the way it's supposed to work or did I brake it somehow?

for example, with the current huge mess of URL in there, LWP::Simple "get" gives me nothing, blank. ----edit: (it doesn't trip my " != defined" error-catch either).

However, if I replace that giant URL with something like "http://www.google.com" or "http://www.yahoo.com" it prints out the HTML code right away.

What am I doing wrong?
Thank you for any help.

Code:
#!/usr/bin/perl
use strict;
use LWP::Simple;

my $urlSearch = convertSrchToUrl($ARGV[0]);
my $htmlUnfiltered = returnHtml($urlSearch);

sub returnHtml {
  my $content = get $_[0];
  if ($content != defined){
    die "Could not recieve html data for given url: $urlSearch";
  }
  print $_[0];
  print $content;
  return($content);
}

sub convertSrchToUrl {
  my $url = "http://www.google.com/#hl=en&output=search&sclient=psy-ab&q=" . $_[0] . "&pbx=1&oq=" . $_[0] . "&aq=f&aqi=g-m1&aql=&gs_sm=3&gs_upl=709l3851l0l4189l6l6l0l0l0l0l99l478l6l6l0&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=569c85e601367506&biw=665&bih=607";
  return($url);
}
 
found my problem, first of all my error catch wasn't working anyways ( i could have sworn i tested it, i guess not)

Second of all, LWP::Simple modules sends HTTP packets that are labeled: "libwww-perl/#.###" for the User Agent (or, in other words, the browser type)

Google sees that you are trying to make a PERL script to access their site and blocks you automatically.

edit:
solution
Code:
#!/usr/bin/perl
use strict;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(20);
$ua->agent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11\r\n");

my $response = $ua->get($ARGV[0]);
die "Can't get $ARGV[0] -- ", $response->status_line unless $response->is_success;

die "Expected HTML, not ", $response->content_type unless $response->content_type eq 'text/html';

print $response->content
 
I am guessing you are using perl because you plan on processing the HTML later, but why not just use wget to fetch the page and pipe it into your perl script? That would be much easier and make use of the tools at hand. Unless you are running perl on windows which would be very confusing to me.

Also I think it may be against Google's TOS to go around their search API without a proper license and too many queries may get you blocked from using the site.
 
perl homework problem. I see google also has an API for 100 requests or less per day, I will probably look into that.
 
Back
Top