An Inteum RSS patents harvester for Pure.
A set of Python tools to make help transform and data for Pure.
Disclaimer: This code is provided “as-is” for educational purposes under the MIT license and should in no way be considered production-ready code. The author is not obligated to maintain, fix or otherwise provide technical support to end-users. Please refer to the license for more details.
Tools:
pip install -r requirements.txt
to add packages. <YOUR_PURE_NAME>\_out.xml
can be imported into Pure.Each site is a Site
object that contains a Pure API key, name, Pure URL, root org ID and source URL (to pull XML data from).
The script supports harvesting multiple sites. Simply add these to the sites.json file to instantiate additional Site
objects.
[
{
"py/object": "puresite.Site",
"api_key": "<YOUR_API_KEY",
"name": "<YOUR_PURE_NAME>",
"pure_url": "<YOUR_PURE_URL>",
"root_org": "<YOUR_ROOT_ORG_ID>",
"url": "<YOUR_DATA_PURL>"
}
]
Example:
[
{
"py/object": "puresite.Site",
"api_key": "xxxxxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"name": "my_name",
"pure_url": "https://mypure.elsevierpure.com",
"root_org": "my_root_org_source_id",
"url": "http://my_inteum_site.technologypublisher.com/RssDataFeed.aspx?UpdateOnOrAfter=1/1/2010"
}
]
See rss-to-pubs.xsl
for a stylesheet that converts the XML to be ingested by Pure. Note that parameters are passed from Python, hence the transformation will not be complete when run outside of patents.py
.
The RSS feed does not contain unique identifiers for inventors. Pure will only match inventors to internal persons if there is an exact match on the name, which cannot always be guaranteed. To get around this, the script will attempt to map inventors to persons in Pure.
See puresite.py for details on how to match internal Pure persons with a Lucene query against the Pure API. Successful matches are mapped to the externalId
(Pure source ID) of a person with a closely matching name.
Note: If multiple persons match on the same name, only the last match will be saved. Additional functionality can be added for more sophisticated matching logic.
Each match will be added to a dictionary and saved to a JSON file “
{"John;Doe": "7213cfd7abac1973c9d018a3fb1022f3"}
Matches are used by the XSLT to enrich persons with an internal ID using an extension function python:lookup_person
.
Any matches will be saved to a file called