跳到主要内容

Algolia DocSearch

2023年02月20日
柏拉文
越努力,越幸运

一、注册账号


1.1 打开官网

Preview

1.2 进入控制台

Preview

二、数据源

2.1 创建数据源

1. 点击Data Sources

Preview

2. 点击Application 下的上下箭头

Preview

3. 点击Create Application

Preview

4. 输入数据源信息: 数据员名称、类型

Preview

5. 选择服务器站点地址: 选择最近的站点

Preview

6. 全部勾选后,点击Create Application

Preview

三、数据索引


3.1 创建数据索引

1. 输入数据索引名称,点击创建按钮, 即可创建成功

Preview

四、配置爬取环境


4.1 安装依赖

  • 安装jq

    brew install jq
  • 安装Docker

    brew install docker

4.2 启动 Docker

点击Docker图标即可启动成功

4.3 查看相应配置

  1. 点击首页,点击API Keys
Preview
  1. 可以看到对应的APP IDAPI Key
Preview

4.4 配置相应文件

  1. 项目根目录创建.env文件, 用于存放环境变量

    ALGOLIA_APP_ID=xxx  // Application ID
    ALGOLIA_API_KEY=xxx // Admin API Key
  2. 项目根目录创建docsearch.json文件, 用于配置docsearch

    {
    "index_name": "数据索引名称",
    "start_urls": ["https://www.website.com/"], // 域名,比如 ["http://test.bolawen.com/","https://bolawen.com/"]
    "sitemap_urls": ["https://www.website.com/sitemap.xml"], // 域名,比如 ["http://test.bolawen.com/sitemap.xml","https://bolawen.com/sitemap.xml"]
    "stop_urls": ["/search"], // 排除不需要爬取页面的路由地址
    "selectors": {
    "lvl0": {
    "selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
    "type": "xpath",
    "global": true,
    "default_value": "Documentation"
    },
    "lvl1": "header h1, article h1",
    "lvl2": "article h2",
    "lvl3": "article h3",
    "lvl4": "article h4",
    "lvl5": "article h5, article td:first-child",
    "lvl6": "article h6",
    "text": "article p, article li, article td:last-child"
    },
    "custom_settings": {
    "attributesForFaceting": [
    "type",
    "lang",
    "language",
    "version",
    "docusaurus_tag"
    ],
    "attributesToRetrieve": [
    "hierarchy",
    "content",
    "anchor",
    "url",
    "url_without_anchor",
    "type"
    ],
    "attributesToHighlight": ["hierarchy", "content"],
    "attributesToSnippet": ["content:10"],
    "camelCaseAttributes": ["hierarchy", "content"],
    "searchableAttributes": [
    "unordered(hierarchy.lvl0)",
    "unordered(hierarchy.lvl1)",
    "unordered(hierarchy.lvl2)",
    "unordered(hierarchy.lvl3)",
    "unordered(hierarchy.lvl4)",
    "unordered(hierarchy.lvl5)",
    "unordered(hierarchy.lvl6)",
    "content"
    ],
    "distinct": true,
    "attributeForDistinct": "url",
    "customRanking": [
    "desc(weight.pageRank)",
    "desc(weight.level)",
    "asc(weight.position)"
    ],
    "ranking": [
    "words",
    "filters",
    "typo",
    "attribute",
    "proximity",
    "exact",
    "custom"
    ],
    "highlightPreTag": "<span class='algolia-docsearch-suggestion--highlight'>",
    "highlightPostTag": "</span>",
    "minWordSizefor1Typo": 3,
    "minWordSizefor2Typos": 7,
    "allowTyposOnNumericTokens": false,
    "minProximity": 1,
    "ignorePlurals": true,
    "advancedSyntax": true,
    "attributeCriteriaComputedByMinProximity": true,
    "removeWordsIfNoResults": "allOptional",
    "separatorsToIndex": "_",
    "synonyms": [
    ["js", "javascript"],
    ["ts", "typescript"]
    ]
    }
    }
  3. 项目根目录配置docusaurus.config.js

    algolia: {
    appId: "K0K81***KFR", // Application ID
    apiKey: "67eaa53e310****5cc869a11dcb1ff9", // Admin API Key
    indexName: "blog-docusa****ebpack-algolia-index", // 数据索引名称
    },

五、运行 Docker , 爬取数据


控制台执行 docker 爬去推送命令

docker run -it --env-file=.env -e "CONFIG=$(cat docsearch.json | jq -r tostring)" algolia/docsearch-scraper

接下来就是等待阶段,这里需要点时间 download docker 内置的东西。最后控制台出现:

Preview

说明就在推送本地爬取的内容到 algolia 了。可以到algolia上看下我们爬取的数据:

Preview