perl LWP::Simple

serpretetsky · Feb 28, 2012

I'm trying to use perl and LWP::Simple to make a small script that gets the html code for a google search. Currently I just have it print out the code and URL it used to get the code in the "returnHtml" subroutine.

The problem is that LWP::Simple is only working on just the domain name, not the whole URL. Is this the way it's supposed to work or did I brake it somehow?

for example, with the current huge mess of URL in there, LWP::Simple "get" gives me nothing, blank. ----edit: (it doesn't trip my " != defined" error-catch either).

However, if I replace that giant URL with something like "http://www.google.com" or "http://www.yahoo.com" it prints out the HTML code right away.

What am I doing wrong?
Thank you for any help.

Code:

#!/usr/bin/perl
use strict;
use LWP::Simple;

my $urlSearch = convertSrchToUrl($ARGV[0]);
my $htmlUnfiltered = returnHtml($urlSearch);

sub returnHtml {
  my $content = get $_[0];
  if ($content != defined){
    die "Could not recieve html data for given url: $urlSearch";
  }
  print $_[0];
  print $content;
  return($content);
}

sub convertSrchToUrl {
  my $url = "http://www.google.com/#hl=en&output=search&sclient=psy-ab&q=" . $_[0] . "&pbx=1&oq=" . $_[0] . "&aq=f&aqi=g-m1&aql=&gs_sm=3&gs_upl=709l3851l0l4189l6l6l0l0l0l0l99l478l6l6l0&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=569c85e601367506&biw=665&bih=607";
  return($url);
}

serpretetsky · Mar 2, 2012

found my problem, first of all my error catch wasn't working anyways ( i could have sworn i tested it, i guess not)

Second of all, LWP::Simple modules sends HTTP packets that are labeled: "libwww-perl/#.###" for the User Agent (or, in other words, the browser type)

Google sees that you are trying to make a PERL script to access their site and blocks you automatically.

edit:
solution

Code:

#!/usr/bin/perl
use strict;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(20);
$ua->agent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11\r\n");

my $response = $ua->get($ARGV[0]);
die "Can't get $ARGV[0] -- ", $response->status_line unless $response->is_success;

die "Expected HTML, not ", $response->content_type unless $response->content_type eq 'text/html';

print $response->content

JosiahBradley · Mar 4, 2012

I am guessing you are using perl because you plan on processing the HTML later, but why not just use wget to fetch the page and pipe it into your perl script? That would be much easier and make use of the tools at hand. Unless you are running perl on windows which would be very confusing to me.

Also I think it may be against Google's TOS to go around their search API without a proper license and too many queries may get you blocked from using the site.

serpretetsky · Mar 4, 2012

perl homework problem. I see google also has an API for 100 requests or less per day, I will probably look into that.

perl LWP::Simple

serpretetsky

2[H]4U

serpretetsky

2[H]4U

JosiahBradley

[H]ard|Gawd

serpretetsky

2[H]4U