Mercurial > hg > Feedworm
annotate backend/AbstractFeedUpdater.py @ 257:75b81da8d7a5
convert the feed entry timestamps to arango compatible date strings in migration
author | Dirk Olmes <dirk@xanthippe.ping.de> |
---|---|
date | Tue, 12 Mar 2019 02:38:41 +0100 |
parents | 8e73a8ae863f |
children |
rev | line source |
---|---|
217
bb3c851b18b1
add source file endcoding header
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
213
diff
changeset
|
1 # -*- coding: utf-8 -*- |
206
f74fe7cb5091
when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents:
197
diff
changeset
|
2 import AbstractBackend |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
3 import feedparser |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
4 import logging |
218
699d8f1cebd4
unify imports, especially Qt imports. Use consistent super syntax
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
217
diff
changeset
|
5 from datetime import datetime |
699d8f1cebd4
unify imports, especially Qt imports. Use consistent super syntax
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
217
diff
changeset
|
6 from urllib2 import ProxyHandler |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
7 |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
8 STATUS_ERROR = 400 |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
9 log = logging.getLogger("FeedUpdater") |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
10 |
245 | 11 """ |
12 Abstract base class for FeedUpdater implementations - handles all the parsing of the feed. | |
13 Subclasses need to implement creating and storing the new feed entries. | |
14 """ | |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
15 class AbstractFeedUpdater(object): |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
16 |
166
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
17 def __init__(self, preferences): |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
18 self.preferences = preferences |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
19 |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
20 def update(self, feed): |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
21 self.feed = feed |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
22 log.info("updating " + feed.rss_url) |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
23 result = self._retrieveFeed() |
167
a3c945ce434c
adjust the sqlalchemy backend to the changes in AbstractFeedUpdater
dirk
parents:
166
diff
changeset
|
24 self._setFeedTitle(result) |
160
86f828096aaf
Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents:
144
diff
changeset
|
25 self._processEntries(result) |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
26 |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
27 def _retrieveFeed(self): |
244
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
28 # when updating to python3 see http://code.google.com/p/feedparser/issues/detail?id=260 |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
29 handlers = None |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
30 if self.preferences.isProxyConfigured() and self.preferences.useProxy(): |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
31 proxyUrl = '{0}:{1}'.format(self.preferences.proxyHost(), self.preferences.proxyPort()) |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
32 proxyHandler = ProxyHandler({'http': proxyUrl, 'https': proxyUrl}) |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
33 handlers = [proxyHandler] |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
34 |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
35 result = feedparser.parse(self.feed.rss_url, handlers) |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
36 if result.bozo > 0: |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
37 log.warn('result contains bozo') |
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
38 log.warn(result) |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
39 # bozo flags if a feed is well-formed. |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
40 # if result["bozo"] > 0: |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
41 # raise FeedUpdateException() |
244
b46d7fe6390b
re-activate the use of proxy
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
243
diff
changeset
|
42 status = result.status |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
43 if status >= STATUS_ERROR: |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
44 raise FeedUpdateException("HTTP status " + str(status)) |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
45 return result |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
46 |
160
86f828096aaf
Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents:
144
diff
changeset
|
47 def _processEntries(self, feedDict): |
86f828096aaf
Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents:
144
diff
changeset
|
48 for entry in feedDict.entries: |
86f828096aaf
Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents:
144
diff
changeset
|
49 self._normalize(entry) |
206
f74fe7cb5091
when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents:
197
diff
changeset
|
50 if not self._isExpired(entry): |
f74fe7cb5091
when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents:
197
diff
changeset
|
51 self._processEntry(entry) |
160
86f828096aaf
Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents:
144
diff
changeset
|
52 self._incrementFeedUpdateDate() |
86f828096aaf
Do not fetch and parse the feed twice when creating a new one. Pass the parsed info into the update method instead to reuse.
dirk
parents:
144
diff
changeset
|
53 |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
54 def _normalize(self, entry): |
197
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
55 self._normalizeId(entry) |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
56 self._normalizePublishedDate(entry) |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
57 self._normalizeUpdatedDate(entry) |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
58 self._normalizeSummary(entry) |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
59 |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
60 def _normalizeId(self, entry): |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
61 if not hasattr(entry, "id"): |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
62 entry.id = entry.link |
197
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
63 |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
64 def _normalizePublishedDate(self, entry): |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
65 if not hasattr(entry, "published"): |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
66 if hasattr(entry, "updated"): |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
67 entry.published = entry.updated |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
68 |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
69 def _normalizeUpdatedDate(self, entry): |
187
2f2016a10f7d
handle a missing updated_parsed attribute in a feed entry gracefully
dirk
parents:
167
diff
changeset
|
70 if not hasattr(entry, "updated_parsed") or entry.updated_parsed is None: |
213
524cbf9e413c
use correct TODO tags so they show up in the tasks view in Eclipse
dirk
parents:
206
diff
changeset
|
71 # TODO: try to parse the entry.updated date string |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
72 entry.updated_parsed = datetime.today() |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
73 else: |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
74 entry.updated_parsed = datetime(*entry.updated_parsed[:6]) |
197
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
75 |
e604c32f67aa
normalize the published date if the feed contains none
dirk
parents:
187
diff
changeset
|
76 def _normalizeSummary(self, entry): |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
77 if not hasattr(entry, "summary"): |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
78 if hasattr(entry, "content"): |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
79 entry.summary = entry.content[0].value |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
80 else: |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
81 entry.summary = "" |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
82 |
206
f74fe7cb5091
when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents:
197
diff
changeset
|
83 def _isExpired(self, entry): |
f74fe7cb5091
when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents:
197
diff
changeset
|
84 expireDate = AbstractBackend.calculateExpireDate(self.preferences) |
f74fe7cb5091
when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents:
197
diff
changeset
|
85 return entry.updated_parsed < expireDate |
f74fe7cb5091
when updating feeds, only ever create new Feed objects for entries that are younger than the current expire date. This ensures that we do not see old, read, expired entries again
dirk
parents:
197
diff
changeset
|
86 |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
87 def _processEntry(self, entry): |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
88 raise Exception("_processEntry is abstract, subclasses must override") |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
89 |
144
74217db92993
updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
141
diff
changeset
|
90 def _incrementFeedUpdateDate(self): |
74217db92993
updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
141
diff
changeset
|
91 raise Exception("_incrementNextUpdateDate is abstract, subclasses must override") |
74217db92993
updating feeds on the couchdb backend works now
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
141
diff
changeset
|
92 |
166
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
93 def _setFeedTitle(self, feedDict): |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
94 if self.feed.title is None: |
233
e34c53a3e407
fixes from eric's style check
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
218
diff
changeset
|
95 if 'title' in feedDict.feed: |
166
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
96 self.feed.title = feedDict.feed.title |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
97 else: |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
98 self.feed.title = self.feed.rss_url |
04c3b9796b89
feedparser uses the proxy now if one is configured. To implement this the FeedUpdater had to change a bit - sqlalchemy backend is not yet refactored.
dirk
parents:
160
diff
changeset
|
99 |
141
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
100 |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
101 class FeedUpdateException(Exception): |
6ea813cfac33
pull out common code for updating a feed into an abstract class, have the sqlalchemy backend use that class.
Dirk Olmes <dirk@xanthippe.ping.de>
parents:
diff
changeset
|
102 pass |