项目作者: RobertMyles

项目描述 :
An R package for extracting 'tidy' data frames from RSS, Atom, JSON and geoRSS feeds
高级语言: R
项目地址: git://github.com/RobertMyles/tidyRSS.git
创建时间: 2017-02-23T15:31:43Z
项目社区:https://github.com/RobertMyles/tidyRSS

开源协议:Other

下载


" class="reference-link">tidyRSS

CRAN\_Status\_Badge
CRAN\_Download\_Badge
CRAN\_Download\_Badge
R-CMD-check
Codecov test
coverage

tidyRSS is a package for extracting data from RSS
feeds
, including Atom feeds and JSON
feeds. For geo-type feeds, see the section on changes in version 2
below, or jump directly to
tidygeoRSS, which is
designed for that purpose.

It is easy to use as it only has one function, tidyfeed(), which takes
five arguments:

  • the url of the feed;
  • a logical flag for whether you want the feed returned as a tibble or
    a list containing two tibbles;
  • a logical flag for whether you want HTML tags removed from columns
    in the dataframe;
  • a config list that is passed off to
    httr::GET();
  • and a parse_dates argument, a logical flag, which will attempt to
    parse dates if TRUE (see below).

If parse_dates is TRUE, tidyfeed() will attempt to parse dates
using the anytime package.
Note that this removes some lower-level control that you may wish to
retain over how dates are parsed. See this
issue
for an example.

Installation

It can be installed directly from CRAN
with:

  1. install.packages("tidyRSS")

The development version can be installed from GitHub with the
remotes package:

  1. remotes::install_github("robertmyles/tidyrss")

Usage

Here is how you can get the contents of the R
Journal
:

  1. library(tidyRSS)
  2. tidyfeed("http://journal.r-project.org/rss.atom")

Changes in version 2.0.0

The biggest change in version 2 is that tidyRSS no longer attempts to
parse geo-type feeds into sf
tibbles. This functionality has been moved to
tidygeoRSS.

Issues

XML feeds can be finicky things, if you find one that doesn’t work with
tidyfeed(), feel free to create an
issue with the url of
the feed that you are trying. Pull Requests are welcome if you’d like to
try and fix it yourself. For older RSS feeds, some fields will almost
never be ‘clean’, that is, they will contain things like newlines (\n)
or extra quote marks. Cleaning these in a generic way is more or less
impossible so I suggest you use
stringr,
strex and/or tools from base R
such as gsub to clean these. This will mainly affect the
item_description column of a parsed RSS feed, and will not often
affect Atom feeds (and should never be a problem with JSON).

There are two other related packages that I’m aware of:

In comparison to feedeR, tidyRSS returns more information from the RSS
feed (if it exists), and development on rss seems to have stopped some
time ago.

Other

For the schemas used to develop the parsers in this package, see:

I’ve implemented most of the items in the schemas above. The following
are not yet implemented:

Atom meta info:

  • contributor, generator, logo, subtitle

Rss meta info:

  • cloud
  • image
  • textInput
  • skipHours
  • skipDays