Algolia DocSearch
2023年02月20日
一、注册账号
1.1 打开官网
Preview
1.2 进入控制台
Preview
二、数据源
2.1 创建数据源
1. 点击Data Sources
Preview
2. 点击Application
下的上下箭头
Preview
3. 点击Create Application
Preview
4. 输入数据源信息: 数据员名称、类型
Preview
5. 选择服务器站点地址: 选择最近的站点
Preview
6. 全部勾选后,点击Create Application
Preview
三、数据索引
3.1 创建数据索引
1. 输入数据索引名称,点击创建按钮, 即可创建成功
Preview
四、配置爬取环境
4.1 安装依赖
-
安装
jq
brew install jq
-
安装
Docker
brew install docker
4.2 启动 Docker
点击Docker
图标即可启动成功
4.3 查看相应配置
- 点击
首页
,点击API Keys
Preview
- 可以看到对应的
APP ID
与API Key
Preview
4.4 配置相应文件
-
项目根目录创建
.env
文件, 用于存放环境变量ALGOLIA_APP_ID=xxx // Application ID
ALGOLIA_API_KEY=xxx // Admin API Key -
项目根目录创建
docsearch.json
文件, 用于配置docsearch
{
"index_name": "数据索引名称",
"start_urls": ["https://www.website.com/"], // 域名,比如 ["http://test.bolawen.com/","https://bolawen.com/"]
"sitemap_urls": ["https://www.website.com/sitemap.xml"], // 域名,比如 ["http://test.bolawen.com/sitemap.xml","https://bolawen.com/sitemap.xml"]
"stop_urls": ["/search"], // 排除不需要爬取页面的路由地址
"selectors": {
"lvl0": {
"selector": "(//ul[contains(@class,'menu__list')]//a[contains(@class, 'menu__link menu__link--sublist menu__link--active')]/text() | //nav[contains(@class, 'navbar')]//a[contains(@class, 'navbar__link--active')]/text())[last()]",
"type": "xpath",
"global": true,
"default_value": "Documentation"
},
"lvl1": "header h1, article h1",
"lvl2": "article h2",
"lvl3": "article h3",
"lvl4": "article h4",
"lvl5": "article h5, article td:first-child",
"lvl6": "article h6",
"text": "article p, article li, article td:last-child"
},
"custom_settings": {
"attributesForFaceting": [
"type",
"lang",
"language",
"version",
"docusaurus_tag"
],
"attributesToRetrieve": [
"hierarchy",
"content",
"anchor",
"url",
"url_without_anchor",
"type"
],
"attributesToHighlight": ["hierarchy", "content"],
"attributesToSnippet": ["content:10"],
"camelCaseAttributes": ["hierarchy", "content"],
"searchableAttributes": [
"unordered(hierarchy.lvl0)",
"unordered(hierarchy.lvl1)",
"unordered(hierarchy.lvl2)",
"unordered(hierarchy.lvl3)",
"unordered(hierarchy.lvl4)",
"unordered(hierarchy.lvl5)",
"unordered(hierarchy.lvl6)",
"content"
],
"distinct": true,
"attributeForDistinct": "url",
"customRanking": [
"desc(weight.pageRank)",
"desc(weight.level)",
"asc(weight.position)"
],
"ranking": [
"words",
"filters",
"typo",
"attribute",
"proximity",
"exact",
"custom"
],
"highlightPreTag": "<span class='algolia-docsearch-suggestion--highlight'>",
"highlightPostTag": "</span>",
"minWordSizefor1Typo": 3,
"minWordSizefor2Typos": 7,
"allowTyposOnNumericTokens": false,
"minProximity": 1,
"ignorePlurals": true,
"advancedSyntax": true,
"attributeCriteriaComputedByMinProximity": true,
"removeWordsIfNoResults": "allOptional",
"separatorsToIndex": "_",
"synonyms": [
["js", "javascript"],
["ts", "typescript"]
]
}
} -
项目根目录配置
docusaurus.config.js
algolia: {
appId: "K0K81***KFR", // Application ID
apiKey: "67eaa53e310****5cc869a11dcb1ff9", // Admin API Key
indexName: "blog-docusa****ebpack-algolia-index", // 数据索引名称
},
五、运行 Docker , 爬取数据
控制台执行 docker
爬去推送命令
docker run -it --env-file=.env -e "CONFIG=$(cat docsearch.json | jq -r tostring)" algolia/docsearch-scraper
接下来就是等待阶段,这里需要点时间 download docker
内置的东西。最后控制台出现:
Preview
说明就在推送本地爬取的内容到 algolia
了。可以到algolia
上看下我们爬取的数据:
Preview