我们都知道curl可以发送模拟浏览器请求来获取指定页面源码,很多PHPer会利用curl扩展的这个特性来采集数据。但采集往往都是针对大量页面处理的,如果使用循环一个一个页面的处理会非常耗时,但如果能同时对多个页面进行采集,将可以缩短很多时间。
curl普通的单线程请求是通过 curl_init 实例化然后通过 curl_exec 执行并获取我们想要的信息,如果是多线程的话,则需用通过 curl_multi 系列函数来实现。
下面我们同实例来实现 curl 的多线程请求
以抓取百度搜索结果为例:
关键词 “搜索优化”
url数量:搜索结果前50页
数据采集目标:每个自然搜索结果的标题
1、curl 单线程数据采集处理
<?php
$start = time();
$header = [
'Content-Type: text/html;charset=utf-8',
'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language: zh-CN,zh;q=0.9',
'Cache-Control: no-cache',
'Connection: keep-alive',
'Host:www.baidu.com',
'Pragma: no-cache',
'Upgrade-Insecure-Requests:1',
'Referer:https://www.baidu.com/',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3239.132 Safari/537.36',
'X-Requested-With:XMLHttpRequest'
];
$contents = [];
for ($i=0; $i < 50 ; $i++) {
if ($i > 0) {
$num = 10*$i;
$url = "https://www.baidu.com/s?wd=搜索优化&pn=".$num;
}else{
$url = "https://www.baidu.com/s?wd=搜索优化";
}
$ch = curl_init();
curl_setopt ( $ch, CURLOPT_URL, $url );
curl_setopt ( $ch, CURLOPT_HTTPHEADER, $header );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt( $ch, CURLOPT_SSL_VERIFYHOST, FALSE);
$contents[] = curl_exec ( $ch );
curl_close ( $ch );
}
foreach ($contents as $content) {
$titleRes[] = getTitle($content);
}
var_dump($titleRes);
$end = time();
echo "<hr>";
echo "程序执行耗时:".($end-$start).'秒';
//搜索结果标题提取
function getTitle($content){
preg_match_all('/<h3([^<>]*)>(.*)<\/h3>/Uis',$content,$arr);
$title = [];
if (!empty($arr[2])){
foreach ($arr[2] as $value) {
$title[] = strip_tags($value);
}
}
return $title;
}程序运行情况:

2、curl 多线程数据采集处理
<?php
$start = time();
//设置百度搜索请求头
$header = [
'Content-Type: text/html;charset=utf-8',
'Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language: zh-CN,zh;q=0.9',
'Cache-Control: no-cache',
'Connection: keep-alive',
'Host:www.baidu.com',
'Pragma: no-cache',
'Upgrade-Insecure-Requests:1',
'Referer:https://www.baidu.com/',
'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3239.132 Safari/537.36',
'X-Requested-With:XMLHttpRequest'
];
$chs = [];
for ($i=0; $i < 50 ; $i++) {
if ($i > 0) {
$num = 10*$i;
$url = "https://www.baidu.com/s?wd=搜索优化&pn=".$num;
}else{
$url = "https://www.baidu.com/s?wd=搜索优化";
}
$chs[$i] = curl_init();
curl_setopt ( $chs[$i], CURLOPT_URL, $url );
curl_setopt ( $chs[$i], CURLOPT_HTTPHEADER, $header );
curl_setopt ( $chs[$i], CURLOPT_RETURNTRANSFER, 1 );
curl_setopt( $chs[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt( $chs[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
}
//多线程处理的关键
$mh = curl_multi_init();//创建批处理cURL句柄
foreach ($chs as $ch) {
curl_multi_add_handle($mh, $ch); // 将单个curl 句柄加入批处理 curl句柄中
}
$running=null;
// 执行批处理句柄
do {
usleep(10000);
curl_multi_exec($mh,$running); //执行批处理句柄
} while ($running > 0);
$contents = [];
foreach($chs as $k => $ch){
$contents[] = curl_multi_getcontent($ch); // 获取句柄的返回值
curl_multi_remove_handle($mh, $ch);// 将$mh中的句柄移除
}
curl_multi_close($mh); //关闭全部句柄
$titleRes = [];
foreach ($contents as $content) {
$titleRes[] = getTitle($content);
}
var_dump($titleRes);
$end = time();
echo "<hr>";
echo "程序执行耗时:".($end-$start).'秒';
//搜索结果标题提取
function getTitle($content){
preg_match_all('/<h3([^<>]*)>(.*)<\/h3>/Uis',$content,$arr);
$title = [];
if (!empty($arr[2])){
foreach ($arr[2] as $value) {
$title[] = strip_tags($value);
}
}
return $title;
}程序运行情况:

单线程处理50个页面耗时17s,使用 curl_multi 后同样的50个页面耗时6s,缩短了65%左右的时间,数据采集效率大大提升了。不过50个页面curl多线程也耗时6s,这个效率算不得有多高,curl_multi 还是有很大的优化空间的。
