annotate backend/AbstractFeedUpdater.py @ 205:adf7f617bda9

make the name of the design document configurable via command line switch. When cloning the feedworm db, the design document is no longer the same as the database name
author dirk
date Sat, 02 Jun 2012 04:24:49 +0200
parents e604c32f67aa
children f74fe7cb5091
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
1
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
2 from datetime import datetime
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
3 import feedparser
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
4 import logging
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
5 from urllib2 import ProxyHandler
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
6
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
7 STATUS_ERROR = 400
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
8 log = logging.getLogger("FeedUpdater")
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
9
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
10 class AbstractFeedUpdater(object):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
11 '''
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
12 Abstract base class for FeedUpdater implementations - handles all the parsing of the feed.
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
13 Subclasses need to implement creating and storing the new feed entries.
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
14 '''
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
15
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
16 def __init__(self, preferences):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
17 self.preferences = preferences
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
18
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
19 def update(self, feed):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
20 self.feed = feed
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
21 log.info("updating " + feed.rss_url)
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
22 result = self._retrieveFeed()
167
a3c945ce434c adjust the sqlalchemy backend to the changes in AbstractFeedUpdater
dirk
parents: 166
diff changeset
23 self._setFeedTitle(result)
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
24 self._processEntries(result)
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
25
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
26 def _retrieveFeed(self):
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
27 if self.preferences.isProxyConfigured():
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
28 proxyUrl = "http://%s:%i" % (self.preferences.proxyHost(), self.preferences.proxyPort())
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
29 proxyHandler = ProxyHandler({"http" : proxyUrl})
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
30 result = feedparser.parse(self.feed.rss_url, handlers=[proxyHandler])
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
31 else:
167
a3c945ce434c adjust the sqlalchemy backend to the changes in AbstractFeedUpdater
dirk
parents: 166
diff changeset
32 # when updating to python3 see http://code.google.com/p/feedparser/issues/detail?id=260
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
33 result = feedparser.parse(self.feed.rss_url)
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
34 # bozo flags if a feed is well-formed.
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
35 # if result["bozo"] > 0:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
36 # raise FeedUpdateException()
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
37 status = result["status"]
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
38 if status >= STATUS_ERROR:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
39 raise FeedUpdateException("HTTP status " + str(status))
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
40 return result
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
41
160
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
42 def _processEntries(self, feedDict):
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
43 for entry in feedDict.entries:
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
44 self._normalize(entry)
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
45 self._processEntry(entry)
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
46 self._incrementFeedUpdateDate()
86f828096aaf Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents: 144
diff changeset
47
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
48 def _normalize(self, entry):
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
49 self._normalizeId(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
50 self._normalizePublishedDate(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
51 self._normalizeUpdatedDate(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
52 self._normalizeSummary(entry)
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
53
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
54 def _normalizeId(self, entry):
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
55 if not hasattr(entry, "id"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
56 entry.id = entry.link
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
57
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
58 def _normalizePublishedDate(self, entry):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
59 if not hasattr(entry, "published"):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
60 if hasattr(entry, "updated"):
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
61 entry.published = entry.updated
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
62
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
63 def _normalizeUpdatedDate(self, entry):
187
2f2016a10f7d handle a missing updated_parsed attribute in a feed entry gracefully
dirk
parents: 167
diff changeset
64 if not hasattr(entry, "updated_parsed") or entry.updated_parsed is None:
2f2016a10f7d handle a missing updated_parsed attribute in a feed entry gracefully
dirk
parents: 167
diff changeset
65 # TODO try to parse the entry.updated date string
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
66 entry.updated_parsed = datetime.today()
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
67 else:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
68 entry.updated_parsed = datetime(*entry.updated_parsed[:6])
197
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
69
e604c32f67aa normalize the published date if the feed contains none
dirk
parents: 187
diff changeset
70 def _normalizeSummary(self, entry):
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
71 if not hasattr(entry, "summary"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
72 if hasattr(entry, "content"):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
73 entry.summary = entry.content[0].value
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
74 else:
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
75 entry.summary = ""
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
76
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
77 def _processEntry(self, entry):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
78 raise Exception("_processEntry is abstract, subclasses must override")
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
79
144
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
80 def _incrementFeedUpdateDate(self):
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
81 raise Exception("_incrementNextUpdateDate is abstract, subclasses must override")
74217db92993 updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents: 141
diff changeset
82
166
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
83 def _setFeedTitle(self, feedDict):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
84 if self.feed.title is None:
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
85 if feedDict.feed.has_key("title"):
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
86 self.feed.title = feedDict.feed.title
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
87 else:
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
88 self.feed.title = self.feed.rss_url
04c3b9796b89 feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents: 160
diff changeset
89
141
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
90
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
91 class FeedUpdateException(Exception):
6ea813cfac33 pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff changeset
92 pass