the_scrap, 基于Nokogiri的ruby 网页抓取器

分享于 

11分钟阅读

GitHub

  繁體 雙語
a ruby webpage scraper based on Nokogiri.
  • 源代码名称:the_scrap
  • 源代码网址:http://www.github.com/hjleochen/the_scrap
  • the_scrap源代码文档
  • the_scrap源代码下载
  • Git URL:
    git://www.github.com/hjleochen/the_scrap.git
    Git Clone代码到本地:
    git clone http://www.github.com/hjleochen/the_scrap
    Subversion代码到本地:
    $ svn co --depth empty http://www.github.com/hjleochen/the_scrap
    Checked out revision 1.
    $ cd repo
    $ svn up trunk
    

    碎片

    为什么

    网页数据的抓取最基本的工作流程为:

    • 确定要抓取的起始URL,如:https://ruby-china.org/topics
    • 抓取列表信息,一般列表信息按照tr,li,div,dd等呈现,每个节点为一条记录,如:上述URL中的Cssselector为:"主题"

    在处理以上任务是往往会遇到如下问题:

    • 提取的数据需要进行特殊处理。还是rubychina的例子比如帖子阅读次数:". info 引线"下的内容为:"。618次阅读",需要的只是:618

    ( Scrubyt应该是我用过的比较不错的一个)故根据实际需要慢慢总结形成了现在的方式:

    安装

    将此行添加到你的应用程序的Gemfile中:

     
    gem 'the_scrap'
    
    
    
     

    然后执行:

    
    $ bundle 
    
    
    
    

    或者将它的自己安装为:

    
    $ gem install the_scrap
    
    
    
    

    用法

    # encoding: utf-8require'rubygems'require'the_scrap'require'pp'#create Objectscrap =TheScrap::ListObj.new#set start urlscrap.url ="http://fz.ganji.com/shouji/"#fragment css selector#表示,表格的每一行,或者列表的每个元素#这个行或者元素里面应该包含这条记录的详细信息#详细信息通过attr列表来获取。scrap.item_frag =".layoutlist. list-bigpic"#scrap attr listscrap.attr_name = ['.ft-tit',:inner_html]
    scrap.attr_detail_url = ['.ft-tit',:href]
    scrap.attr_img = ['dt a img',:src]
    scrap.attr_desc ='.feature p'scrap.attr_price ='.fc-org'#debugscrap.debug =truescrap.verbose =true#html preprocessscrap.html_proc <<lambda { |html|
     #html.gsub(/abcd/,'efgh') html
    }#filter scraped itemscrap.item_filters <<lambda { |item_info| 
     returnfalseif item_info['name'].nil? || item_info['name'].length ==0returntrue}#data processscrap.data_proc <<lambda {|url,i|
     i['name'] = i['name'].strip
    }#result processscrap.result_proc <<lambda {|url,items|
     items.each do |item| 
     pp item
     end}##### 此处可以添加 多页分页 抓取功能 参见 2##### 此处可以添加 详细信息页面 抓取功能 参见 3#scrapscrap.scrap_list

    #create ListObj#...########### has many pages ############如果设置了可以根据不同的分页方式抓取多页列表scrap.has_many_pages =true#next page link# [:next_page, :total_pages, :total_records]#:next_pagescrap.pager_method =:next_pagescrap.next_page_css =".next_page a"#:total_pagescrap.pager_method =:total_pagesscrap.get_page_count =lambda { |doc|
     if doc.css('.total_p[age').text =~/(d+)页/$~[1].to_i
     else0end}
    scrap.get_next_url =lambda { |url,next_page_number|
     #url is http://fz.ganji.com/shouji/#page url pattern http://fz.ganji.com/shouji/o#{page_number}/ url +="/o#{next_page_number}"}#**total_record in progressscrap.pager_method =:total_records#...scrap.scrap_list

    如果DetailObj不是单独运行而是在ListObj中运行,抓取的信息将合并到ListObj的结果中去

    #create ListObj#extra detail page urlscrap.attr_detail_url = [".list a",:href]
    ...################# has detail page #################如果设置了可以根据之前抓取的详细页面URL获取详细页面信息#1. define a detail objectscrap_detail =TheScrap::DetailObj.newscrap_detail.attr_title =".Tbox h3"scrap_detail.attr_detail =".Tbox. newsatr"scrap_detail.attr_content = [".Tbox. view",:inner_html]#optional html preprocessscrap_detail.html_proc <<lambda{ |response|
    }#optional data processscrap_detail.data_proc <<lambda {|url,i|
    }#optional result process#此处可选,抓取的信息将合并到列表页面抓取的记录中去,也可以单独入库了。scrap_detail.result_proc <<lambda {|url,items|
    }#get url from list attr and extra data by scrap_detailscrap.detail_info << [scrap_detail,'detail_url']#scrap.detail_info <<[scrap_detail_1,'detail_url_1']#...scrap.scrap_list

    抓取后将全部放到一个hash中,其中"元素名称"为hash的key,获取的数据为hash的值

    
    scrap.attr_name =".title"
    
    
    
    

    则结果item ['name'] ="。title对应的节点内容""

    4.1直接使用CSS选择器

    直接使用CSSSelector的情况下,则取得CSS节点对应的文本内容(inner_text)

    @book_info.attr_author ="#divBookInfo. title a"

    scrap.attr_name = [css_selector,attrs ]

    其中数值的第一个元素为:css_selector

    :frag_attr

    scrap.attr_name = [:frag_attr,'href']

    :inner_html

    取节点内的html

    :加入

    遇到某个list时,需要把里面的元素全部获取并使用逗号分隔。如:tags

    <ulclass="tags">
    <li>ruby</li>
    <li>rails</li>
    <li>activerecord</li>
    </ul>
    scrap.attr_name = ['.tags', :join]

    使用上述取得一个字符串:

    "ruby,rails,activerecord"

    :数组

    遇到某个list时,需要把里面的元素全部获取并返回一个Array

    <ulclass="tags">
    <li>ruby</li>
    <li>rails</li>
    <li>activerecord</li>
    </ul>
    scrap.attr_name = ['.tags', :array]

    使用上述取得一个字数组:

    ['ruby','rails','activerecord']

    :src

    取得图片的src属性,并且使用uri。join ( current_page_url,src_value )

    :href

    取得链接的href属性,并且使用uri。join ( current_page_url,href_value )

    "其它"

    实例

    @book_info=TheScrap::DetailObj.new@book_info.attr_name ="#divBookInfo. title h1"@book_info.attr_author ="#divBookInfo. title a"@book_info.attr_desc = [".intro. txt",:inner_html]@book_info.attr_pic_url = ['.pic_box a img',:src]@book_info.attr_chapters_url = ['.book_pic. opt li[1] a',:href]@book_info.attr_book_info =".info_box table tr"@book_info.attr_cat_1 ='.box_title. page_site a[2]'@book_info.attr_tags = ['.book_info. other. labels. box[1] a',:array]@book_info.attr_user_tags = ['.book_info. other. labels. box[2] a',:join]@book_info.attr_rate ='#bzhjshu'@book_info.attr_rate_cnt = ["#div_pingjiarenshu",'title']@book_info.attr_last_updated_at ="#divBookInfo. tabs. right"@book_info.attr_last_chapter ='.updata_cont. title a'@book_info.attr_last_chapter_desc = ['.updata_cont. cont a',:inner_html]

    baidu.data_proc <<lambda {|url,i|
     i['title'] = i['title'].strip
     if i['ori_url'] =~/view.aspx?id=(d+)/ i['ori_id'] =$~[1].to_i
     endif i['detail'] =~/发布时间:(.*?)/ i['updated_at'] = i['created_at'] =$~[1]
     endif i['detail'] =~/来源:(.*?)作者:/ i['description'] =$~[1].strip
     end i.delete('detail')
     i['content'].gsub!(/<script type="text/javascript">.*?</script>/m,'');
     i['content'].gsub!(/<style>.*?</style>/m,'');
     i['content'].gsub!(/<img class="img_(sina|qq)_share".*?>/m,'');
     if i['content'] =~/image=(.*?)"/#i['image'] = open($~[1]) if $~[1].length> 0end i['site_id'] =@site_id i['cat_id'] =@cat_id time =Time.parse(i['updated_at'])
     prep ='['+time.strftime('%y%m%d')+']'}

    mysql
    require'active_record'require'mysql2'require'activerecord-import'#recommendActiveRecord::Base.establish_connection( :adapter => "mysql2", :host => "localhost",
     :database => "test", :username => "test", :password => "" )ActiveRecord::Base.record_timestamps =falseclassArticle <ActiveRecord::Base validates :ori_id, :uniqueness => trueend# OR load Rails env!scrap.result_proc <<lambda {|url,items|
     articles = []
     items.each do |item| 
     #item[:user_id] = 1 articles <<Article.new(item)
     endArticle.import articles
    }
    require'mongoid'Mongoid.load!("./mongoid.yml", :production)Mongoid.allow_dynamic_fields =trueclassArticleincludeMongoid::Document#....end# OR load Rails env!scrap.result_proc <<lambda {|url,items|
     items.each do |item| 
     #item[:user_id] = 1Article.create(item)
     end}

    json,xml。

    #jsonscrap.result_proc <<lambda {|url,items|
     File.open("xxx.json",'w').write(items.to_json)
    }#xmlscrap.result_proc <<lambda {|url,items|
     articles = []
     items.each do |item| 
     articles << item.to_xml
     end file =File.open("xxx.xml",'w')
     file.write('<articles>')
     file.write(articles.join(''))
     file.write('</articles>')
     file.close
    }

    待办事项


    WEB  BASE  WEBP  Scrape  网页  Scraper  
    相关文章