Spider 包讲解 | Pholcus 完全手册

3 采集规则

3.1 Spider结构体

Spider结构体用于定义蜘蛛规则。

3.1.1 一条简单的蜘蛛规则

func init() {
    BaiduSearch.AddMenu()
}

var BaiduSearch = &Spider{
    Name:        "百度搜索",
    Description: "百度搜索结果 [www.baidu.com]",
    // Pausetime: [2]uint{uint(3000), uint(1000)},
    Keyword:   CAN_ADD,
    UseCookie: false,
    RuleTree: &RuleTree{
        Root: func(self *Spider) {
            self.Aid("生成请求", map[string]interface{}{"loop": [2]int{0, 1}, "Rule": "生成请求"})
        },

        Trunk: map[string]*Rule{

            "生成请求": {
                AidFunc: func(self *Spider, aid map[string]interface{}) interface{} {
                    for loop := aid["loop"].([2]int); loop[0] < loop[1]; loop[0]++ {
                        self.AddQueue(map[string]interface{}{
                            "Url":  "http://www.baidu.com/s?ie=utf-8&nojc=1&wd=" + self.GetKeyword() + "&rn=50&pn=" + strconv.Itoa(50*loop[0]),
                            "Rule": aid["Rule"],
                        })
                    }
                    return nil
                },
                ParseFunc: func(self *Spider, resp *context.Response) {
                    query := resp.GetDom()
                    total1 := query.Find(".nums").Text()
                    re, _ := regexp.Compile(`[\D]*`)
                    total1 = re.ReplaceAllString(total1, "")
                    total2, _ := strconv.Atoi(total1)
                    total := int(math.Ceil(float64(total2) / 50))
                    if total > self.MaxPage {
                        total = self.MaxPage
                    } else if total == 0 {
                        Log.Printf("[消息提示：| 任务：%v | 关键词：%v | 规则：%v] 没有抓取到任何数据！!!\n", self.GetName(), self.GetKeyword(), resp.GetRuleName())
                        return
                    }
                    // 调用指定规则下辅助函数
                    self.Aid("生成请求", map[string]interface{}{"loop": [2]int{1, total}, "Rule": "搜索结果"})
                    // 用指定规则解析响应流
                    self.Parse("搜索结果", resp)
                },
            },

            "搜索结果": {
                //注意：有无字段语义和是否输出数据必须保持一致
                OutFeild: []string{
                    "标题",
                    "内容",
                    "不完整URL",
                    "百度跳转",
                },
                ParseFunc: func(self *Spider, resp *context.Response) {
                    query := resp.GetDom()
                    query.Find("#content_left .c-container").Each(func(i int, s *goquery.Selection) {
                        title := s.Find(".t").Text()
                        content := s.Find(".c-abstract").Text()
                        href, _ := s.Find(".t >a").Attr("href")
                        tar := s.Find(".g").Text()

                        re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")
                        // title = re.ReplaceAllStringFunc(title, strings.ToLower)
                        // content = re.ReplaceAllStringFunc(content, strings.ToLower)

                        title = re.ReplaceAllString(title, "")
                        content = re.ReplaceAllString(content, "")

                        // 结果存入Response中转
                        resp.AddItem(map[string]interface{}{
                            self.OutFeild(resp, 0): strings.Trim(title, " \t\n"),
                            self.OutFeild(resp, 1): strings.Trim(content, " \t\n"),
                            self.OutFeild(resp, 2): tar,
                            self.OutFeild(resp, 3): href,
                        })
                    })
                },
            },
        },
    },
}

3.1.2 Spider成员属性

Spider.Id
编写规则时无需指定，为SpiderQueue自动设置；
Spider.Name
规则名称，必须设置，且须保证全局唯一，在Task任务分发及规则列表中调用；
Spider.Description
规则描述，将在规则列表中显示；
Spider.Pausetime
设置暂停时间Pausetime[0]~Pausetime[0]+Pausetime[1]，一般在UI统一设置，无需在规则中设置；
Spider.MaxPage
在UI设置，规则中涉及采集页数控制时，调用该属性值；
Spider.Keyword
UI中自定义输入，如需使用该属性，则必须在规则中初始化该值为常量USE；
Spider.UseCookie
控制下载器的两种运行模式，为true时支持登录功能，为false时支持大量UserAgent随机轮换，需在规则中指定；
Spider.Proxy
代理服务器；
Spider.RuleTree
采集规则的核心部分，包含所有规则解析函数；
Spider.RuleTree.Root
规则树的树根，即采集规则的执行入口函数；
Spider.RuleTree.Trunk
规则树树干，map[string]*Rule类型，包含除入口函数外，所有解析规则；
Spider.RuleTree.Trunk["name"]Rule
就像是树干上的树枝一样，它是规则树上的一条解析规则；
Rule.OutFeild
输出字段，在非数据库的输出方式（Excel/csv）中作为标题行，规则中提交结果时，字段名应依赖改字段，该条Rule有OutFeild与该条规则有无结果输出必须保持一致；
Rule.ParseFunc
下载内容的解析函数，有系统根据response指定Rule自动调用；
Rule.AidFunc
辅助函数，根据需要你可以自由定制各种处理方法，例如最常用的便是批量生成请求。

3.1.3 Spider中常用方法讲解

func (self *Spider) AddMenu()
将蜘蛛规则自身添加至蜘蛛规则列表；
func (self Spider) AddOutFeild(respOrRuleName interface{}, feild string)
动态追加OutFeild切片成员，速度不如静态字段快 respOrRuleName接受Response或string两种类型，为*Response类型时指定当前Rule；
func (self *Spider) AddQueue(param map[string]interface{}) 向调度器添加请求；
func (self *Spider) BulkAddQueue(urls []string, param map[string]interface{})
批量添加同一类请求，要求除url不同外，其他请求参数相同；
func (self Spider) Parse(ruleName string, resp context.Response)
调用指定Rule下的ParseFunc方法；
func (self *Spider) Aid(ruleName string, aid map[string]interface{}) interface{}
调用指定Rule下的AidFunc方法；
func (self Spider) OutFeild(respOrRuleName interface{}, index int) string
获取指定Rule的OutFeild，respOrRuleName接受Response或string两种类型，为*Response类型时指定当前Rule；
func (self *Spider) GetKeyword() string
获取UI输入的自定义字符串；
func (self *Spider) GetMaxPage() int
获取UI输入的最大采集页数限制；
func (self Spider) GetRules() map[string]Rule
获取整个规则树；
func (self *Spider) SetPausetime(pause [2]uint, runtime ...bool)
设置暂停时间 pause[0]~(pause[0]+pause[1]) 当且仅当runtime[0]为true时可覆盖现有值。