annotate backend/AbstractFeedUpdater.py @ 237:5bbb1e707e56

add maintainance script
author Dirk Olmes <dirk@xanthippe.ping.de>
date Thu, 23 Apr 2015 07:59:19 +0200
parents e34c53a3e407
children 03e3ebb1d52f
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
217
bb3c851b18b1 add source file endcoding header
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 213
diff changeset
1 # -*- coding: utf-8 -*-
206
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
2 import AbstractBackend
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
3 import feedparser
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
4 import logging
218
699d8f1cebd4 unify imports, especially Qt imports. Use consistent super syntax
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 217
diff changeset
5 from datetime import datetime
699d8f1cebd4 unify imports, especially Qt imports. Use consistent super syntax
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 217
diff changeset
6 from urllib2 import ProxyHandler
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
7
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
8 STATUS_ERROR = 400
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
9 log = logging.getLogger("FeedUpdater")
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
10
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
11 class AbstractFeedUpdater(object):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
12 '''
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
13 Abstract base class for FeedUpdater implementations - handles all the parsing of the feed.
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
14 Subclasses need to implement creating and storing the new feed entries.
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
15 '''
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
16
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
17 def __init__(self, preferences):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
18 self.preferences = preferences
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
19
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
20 def update(self, feed):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
21 self.feed = feed
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
22 log.info("updating " + feed.rss_url)
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
23 result = self._retrieveFeed()
167
a3c945ce434c adjust the sqlalchemy backend to the changes in AbstractFeedUpdater
dirk
parents: 166
diff changeset
24 self._setFeedTitle(result)
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
25 self._processEntries(result)
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
26
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
27 def _retrieveFeed(self):
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
28 if self.preferences.isProxyConfigured():
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
29 proxyUrl = "http://%s:%i" % (self.preferences.proxyHost(), self.preferences.proxyPort())
233
e34c53a3e407 fixes from eric's style check
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 218
diff changeset
30 proxyHandler = ProxyHandler({"http": proxyUrl})
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
31 result = feedparser.parse(self.feed.rss_url, handlers=[proxyHandler])
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
32 else:
167
a3c945ce434c adjust the sqlalchemy backend to the changes in AbstractFeedUpdater
dirk
parents: 166
diff changeset
33 # when updating to python3 see http://code.google.com/p/feedparser/issues/detail?id=260
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
34 result = feedparser.parse(self.feed.rss_url)
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
35 # bozo flags if a feed is well-formed.
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
36 # if result["bozo"] > 0:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
37 # raise FeedUpdateException()
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
38 status = result["status"]
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
39 if status >= STATUS_ERROR:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
40 raise FeedUpdateException("HTTP status " + str(status))
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
41 return result
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
42
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
43 def _processEntries(self, feedDict):
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
44 for entry in feedDict.entries:
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
45 self._normalize(entry)
206
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
46 if not self._isExpired(entry):
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
47 self._processEntry(entry)
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
48 self._incrementFeedUpdateDate()
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
49
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
50 def _normalize(self, entry):
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
51 self._normalizeId(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
52 self._normalizePublishedDate(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
53 self._normalizeUpdatedDate(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
54 self._normalizeSummary(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
55
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
56 def _normalizeId(self, entry):
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
57 if not hasattr(entry, "id"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
58 entry.id = entry.link
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
59
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
60 def _normalizePublishedDate(self, entry):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
61 if not hasattr(entry, "published"):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
62 if hasattr(entry, "updated"):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
63 entry.published = entry.updated
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
64
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
65 def _normalizeUpdatedDate(self, entry):
187
2f2016a10f7d handle a missing updated_parsed attribute in a feed entry gracefully
dirk
parents: 167
diff changeset
66 if not hasattr(entry, "updated_parsed") or entry.updated_parsed is None:
213
524cbf9e413c use correct TODO tags so they show up in the tasks view in Eclipse
dirk
parents: 206
diff changeset
67 # TODO: try to parse the entry.updated date string
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
68 entry.updated_parsed = datetime.today()
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
69 else:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
70 entry.updated_parsed = datetime(*entry.updated_parsed[:6])
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
71
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
72 def _normalizeSummary(self, entry):
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
73 if not hasattr(entry, "summary"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
74 if hasattr(entry, "content"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
75 entry.summary = entry.content[0].value
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
76 else:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
77 entry.summary = ""
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
78
206
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
79 def _isExpired(self, entry):
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
80 expireDate = AbstractBackend.calculateExpireDate(self.preferences)
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
81 return entry.updated_parsed < expireDate
f74fe7cb5091 when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents: 197
diff changeset
82
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
83 def _processEntry(self, entry):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
84 raise Exception("_processEntry is abstract, subclasses must override")
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
85
144
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
86 def _incrementFeedUpdateDate(self):
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
87 raise Exception("_incrementNextUpdateDate is abstract, subclasses must override")
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
88
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
89 def _setFeedTitle(self, feedDict):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
90 if self.feed.title is None:
233
e34c53a3e407 fixes from eric's style check
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 218
diff changeset
91 if 'title' in feedDict.feed:
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
92 self.feed.title = feedDict.feed.title
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
93 else:
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
94 self.feed.title = self.feed.rss_url
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
95
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
96
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
97 class FeedUpdateException(Exception):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
98 pass