项目作者: LapisDev

项目描述 :
A collection of portable class libraries for collecting web data written in C#.
高级语言: C#
项目地址: git://github.com/LapisDev/WebCrawling.git
创建时间: 2017-08-02T14:51:10Z
项目社区:https://github.com/LapisDev/WebCrawling

开源协议:MIT License

下载


WebCrawling

A collection of portable class libraries for collecting web data written in C#.
Now it has been ported to .NET Core.

Lapis.WebCrawling

Lapis.WebCrawling provides basic infrastructure for
downloading web pages.

Lapis.WebCrawling.HtmlParsing

Lapis.WebCrawling.HtmlParsing mainly contains:

  • HTML parser and DOM representation;
  • css selectors.

Below is an example using a css selector to search a node in HTML DOM.

  1. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
  2. <html>
  3. <head>
  4. <title>Test Document</title>
  5. <link rel="copyright copyleft" hreflang="en-us">
  6. </head>
  7. <body>
  8. <h1>This is just a test document</h1>
  9. <p>It will be used in tests.</p>
  10. <p id="info">It probably will not be used anywhere else.</p>
  11. <h2>Really, it's unfortunate.</h2>
  12. <q lang="en-us">Here's a quotation.</q>
  13. <p id="google">You might want to check out <a href="http://www.google.com">Google</a></p>
  14. <form action="">
  15. <input type="text" disabled="disabled">
  16. <input type="text">
  17. <input type="checkbox" checked="checked">
  18. </form>
  19. <p class="more">Nothing to really talk about.</p>
  20. </body>
  21. </html>
  1. HtmlDocument doc = Html.Parse(html);
  2. var tag = doc.Find("body > :last-child") as HtmlElement;
  3. Console.WriteLine(tag.Name); // p
  4. Console.WriteLine(tag.Attribute("class").Value); // more
  5. Console.WriteLine(tag.Children[0]); // Nothing to really talk about.

Lapis.WebCrawling.Processing

Lapis.WebCrawling.Processing provides basic
infrastructure for extracting data from HTML.