clickbait-detector, 使用深入学习来检测clickbait标题

分享于 

7分钟阅读

GitHub

  繁體 雙語
Detects clickbait headlines using deep learning.
  • 源代码名称:clickbait-detector
  • 源代码网址:http://www.github.com/saurabhmathur96/clickbait-detector
  • clickbait-detector源代码文档
  • clickbait-detector源代码下载
  • Git URL:
    git://www.github.com/saurabhmathur96/clickbait-detector.git
    Git Clone代码到本地:
    git clone http://www.github.com/saurabhmathur96/clickbait-detector
    Subversion代码到本地:
    $ svn co --depth empty http://www.github.com/saurabhmathur96/clickbait-detector
    Checked out revision 1.
    $ cd repo
    $ svn up trunk
    
    Clickbait检测器

    使用深入学习来检测clickbait标题。

    在这里找到 Chrome 扩展( 由 rahulkapoor90 构建)

    要求

    • python 2.7.12
    • Keras 1.2.1
    • Tensorflow 0.12.1
    • Numpy 1.11.1
    • NLTK 3.2.1

    启动

    在项目目录中安装 virtualenv

     
    virtualenv venv
    
    
    
     

    激活 virtualenv

    • 在 Windows 上:

      
      cd venv/Scripts
      
      
      activate
      
      
      
      
    • 在Linux上

      
      source venv/bin/activate
      
      
      
      

    安装需求

    
     pip install -r requirements.txt
    
    
    
    

    尝试一下尝试运行一个示例

    精度

    25年代后的训练精度= 93.8 % ( 丢失= 0.148 )

    25历元后验证精度= 90.15 % ( 损耗= 0.267 )

    示例

    
    $ python src/detect.py"Novak Djokovic stunned as Australian Open title defence ends against Denis Istomin"
    
    
    Using TensorFlow backend.
    
    
    headline is 0.33 % clickbaity
    
    
    
    
    
    $ python src/detect.py"Just 22 Cute Animal Pictures You Need Right Now"
    
    
    Using TensorFlow backend.
    
    
    headline is 85.38 % clickbaity
    
    
    
    
    
    $ python src/detect.py" 15 Beautifully Created Doors You Need To See Before You Die. The One In Soho Blew Me Away"
    
    
    Using TensorFlow backend.
    
    
    headline is 52.29 % clickbaity
    
    
    
    
    
    $ python src/detect.py"French presidential candidate Emmanuel Macrons anti-system angle is a sham | Philippe Marlire"
    
    
    Using TensorFlow backend.
    
    
    headline is 0.05 % clickbaity
    
    
    
    

    模型摘要

    
    ____________________________________________________________________________________________________
    
    
    Layer (type) Output Shape Param # Connected to 
    
    
    ====================================================================================================
    
    
    embedding_1 (Embedding) (None, 20, 30) 195000 embedding_input_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    convolution1d_1 (Convolution1D) (None, 19, 32) 1952 embedding_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    batchnormalization_1 (BatchNorma (None, 19, 32) 128 convolution1d_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    activation_1 (Activation) (None, 19, 32) 0 batchnormalization_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    convolution1d_2 (Convolution1D) (None, 18, 32) 2080 activation_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    batchnormalization_2 (BatchNorma (None, 18, 32) 128 convolution1d_2[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    activation_2 (Activation) (None, 18, 32) 0 batchnormalization_2[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    convolution1d_3 (Convolution1D) (None, 17, 32) 2080 activation_2[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    batchnormalization_3 (BatchNorma (None, 17, 32) 128 convolution1d_3[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    activation_3 (Activation) (None, 17, 32) 0 batchnormalization_3[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    maxpooling1d_1 (MaxPooling1D) (None, 1, 32) 0 activation_3[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    flatten_1 (Flatten) (None, 32) 0 maxpooling1d_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    dense_1 (Dense) (None, 1) 33 flatten_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    batchnormalization_4 (BatchNorma (None, 1) 4 dense_1[0][0] 
    
    
    ____________________________________________________________________________________________________
    
    
    activation_4 (Activation) (None, 1) 0 batchnormalization_4[0][0] 
    
    
    ====================================================================================================
    
    
    Total params: 201,533
    
    
    Trainable params: 201,339
    
    
    Non-trainable params: 194
    
    
    ____________________________________________________________________________________________________
    
    
    
    
    

    数据

    数据集包含 12,000个标题,其中一半是 clickbait。 clickbait标题从 BuzzFeed。NewsWeek。印度时报和Huffington帖子中获得。 真正的/非clickbait标题从印度语。守护者。经济学家,。华尔街日志。国家地理。

    其中一些数据来自存储库的 clickbait分类器。

    Pretrained嵌入

    我用 stanford Pretrained的手套嵌入了 30维。 这加快了训练速度。

    提高准确性

    为了提高准确性

    • 增加嵌入层尺寸( 目前是 30 ) - src/preprocess_embeddings.py
    • 使用更多数据
    • 增加词汇大小- src/preprocess_text.py
    • 增加最大序列长度- src/train.py
    • 进行更好的数据清理

    learn  Detect  HEAD  DEEP  点击  深度学习