annotate backend/AbstractFeedUpdater.py @ 257:75b81da8d7a5

convert the feed entry timestamps to arango compatible date strings in migration
author Dirk Olmes <dirk@xanthippe.ping.de>
date Tue, 12 Mar 2019 02:38:41 +0100
parents 8e73a8ae863f
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
217
bb3c851b18b1 add source file endcoding header
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 213
diff changeset
1 # -*- coding: utf-8 -*-
206
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
2 import AbstractBackend
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
3 import feedparser
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
4 import logging
218
699d8f1cebd4 unify imports, especially Qt imports. Use consistent super syntax
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 217
diff changeset
5 from datetime import datetime
699d8f1cebd4 unify imports, especially Qt imports. Use consistent super syntax
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 217
diff changeset
6 from urllib2 import ProxyHandler
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
7
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
8 STATUS_ERROR = 400
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
9 log = logging.getLogger("FeedUpdater")
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
10
245
8e73a8ae863f Fix the docstrings
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 244
diff changeset
11 """
8e73a8ae863f Fix the docstrings
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 244
diff changeset
12 Abstract base class for FeedUpdater implementations - handles all the parsing of the feed.
8e73a8ae863f Fix the docstrings
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 244
diff changeset
13 Subclasses need to implement creating and storing the new feed entries.
8e73a8ae863f Fix the docstrings
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 244
diff changeset
14 """
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
15 class AbstractFeedUpdater(object):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
16
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
17 def __init__(self, preferences):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
18 self.preferences = preferences
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
19
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
20 def update(self, feed):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
21 self.feed = feed
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
22 log.info("updating " + feed.rss_url)
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
23 result = self._retrieveFeed()
167
a3c945ce434c adjust the sqlalchemy backend to the changes in AbstractFeedUpdater
dirk
parents: 166
diff changeset
24 self._setFeedTitle(result)
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
25 self._processEntries(result)
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
26
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
27 def _retrieveFeed(self):
244
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
28 # when updating to python3 see http://code.google.com/p/feedparser/issues/detail?id=260
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
29 handlers = None
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
30 if self.preferences.isProxyConfigured() and self.preferences.useProxy():
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
31 proxyUrl = '{0}:{1}'.format(self.preferences.proxyHost(), self.preferences.proxyPort())
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
32 proxyHandler = ProxyHandler({'http': proxyUrl, 'https': proxyUrl})
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
33 handlers = [proxyHandler]
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
34
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
35 result = feedparser.parse(self.feed.rss_url, handlers)
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
36 if result.bozo > 0:
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
37 log.warn('result contains bozo')
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
38 log.warn(result)
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
39 # bozo flags if a feed is well-formed.
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
40 # if result["bozo"] > 0:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
41 # raise FeedUpdateException()
244
b46d7fe6390b re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 243
diff changeset
42 status = result.status
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
43 if status >= STATUS_ERROR:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
44 raise FeedUpdateException("HTTP status " + str(status))
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
45 return result
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
46
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
47 def _processEntries(self, feedDict):
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
48 for entry in feedDict.entries:
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
49 self._normalize(entry)
206
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
50 if not self._isExpired(entry):
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
51 self._processEntry(entry)
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
52 self._incrementFeedUpdateDate()
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
53
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
54 def _normalize(self, entry):
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
55 self._normalizeId(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
56 self._normalizePublishedDate(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
57 self._normalizeUpdatedDate(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
58 self._normalizeSummary(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
59
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
60 def _normalizeId(self, entry):
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
61 if not hasattr(entry, "id"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
62 entry.id = entry.link
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
63
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
64 def _normalizePublishedDate(self, entry):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
65 if not hasattr(entry, "published"):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
66 if hasattr(entry, "updated"):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
67 entry.published = entry.updated
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
68
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
69 def _normalizeUpdatedDate(self, entry):
187
2f2016a10f7d handle a missing updated_parsed attribute in a feed entry gracefully
dirk
parents: 167
diff changeset
70 if not hasattr(entry, "updated_parsed") or entry.updated_parsed is None:
213
524cbf9e413c use correct TODO tags so they show up in the tasks view in Eclipse
dirk
parents: 206
diff changeset
71 # TODO: try to parse the entry.updated date string
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
72 entry.updated_parsed = datetime.today()
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
73 else:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
74 entry.updated_parsed = datetime(*entry.updated_parsed[:6])
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
75
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
76 def _normalizeSummary(self, entry):
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
77 if not hasattr(entry, "summary"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
78 if hasattr(entry, "content"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
79 entry.summary = entry.content[0].value
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
80 else:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
81 entry.summary = ""
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
82
206
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
83 def _isExpired(self, entry):
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
84 expireDate = AbstractBackend.calculateExpireDate(self.preferences)
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
85 return entry.updated_parsed < expireDate
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
86
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
87 def _processEntry(self, entry):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
88 raise Exception("_processEntry is abstract, subclasses must override")
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
89
144
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
90 def _incrementFeedUpdateDate(self):
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
91 raise Exception("_incrementNextUpdateDate is abstract, subclasses must override")
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
92
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
93 def _setFeedTitle(self, feedDict):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
94 if self.feed.title is None:
233
e34c53a3e407 fixes from eric's style check
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 218
diff changeset
95 if 'title' in feedDict.feed:
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
96 self.feed.title = feedDict.feed.title
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
97 else:
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
98 self.feed.title = self.feed.rss_url
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
99
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
100
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
101 class FeedUpdateException(Exception):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
102 pass