* ci: add Dockerfile and action to build celery worker image * ci: build celery worker on push to jennifer/celery branch * ci: also build celery worker for main branch * ci: Add comment to celery Dockerfile * chore: first stab at a celery/rabbitmq docker-compose * feat: add celery configuration and test task / endpoint * chore: run mq/celery containers for dev work * chore: point to ghcr.io image for celery worker * refactor: move XML parsing duties into XMLDraft Move some PlaintextDraft methods into the Draft base class and implement for the XMLDraft class. Use xml2rfc code from ietf.submit as a model for the parsing. This leaves some mismatch between the PlaintextDraft and the Draft class spec for the get_author_list() method to be resolved. * feat: add api_upload endpoint and beginnings of async processing This adds an api_upload() that behaves analogously to the api_submit() endpoint. Celery tasks to handle asynchronous processing are added but are not yet functional enough to be useful. * perf: index Submission table on submission_date This substantially speeds up submission rate threshold checks. * feat: remove existing files when accepting a new submission After checking that a submission is not in progress, remove any files in staging that have the same name/rev with any extension. This should guard against stale files confusing the submission process if the usual cleanup fails or is skipped for some reason. * refactor: make clear that deduce_group() uses only the draft name * refactor: extract only draft name/revision in clean() method Minimizing the amount of validation done when accepting a file. The data extraction will be moved to asynchronous processing. * refactor: minimize checks and data extraction in api_upload() view * ci: fix dockerfiles to match sandbox testing * ci: tweak celery container docker-compose settings * refactor: clean up Draft parsing API and usage * remove get_draftname() from Draft api; set filename during init * further XMLDraft work - remember xml_version after parsing - extract filename/revision during init - comment out long broken get_abstract() method * adjust form clean() method to use changed API * feat: flesh out async submission processing First basically working pass! * feat: add state name for submission being validated asynchronously * feat: cancel submissions that async processing can't handle * refactor: simplify/consolidate async tasks and improve error handling * feat: add api_submission_status endpoint * refactor: return JSON from submission api endpoints * refactor: reuse cancel_submission method * refactor: clean up error reporting a bit * feat: guard against cancellation of a submission while validating Not bulletproof but should prevent * feat: indicate that a submission is still being validated * fix: do not delete submission files after creating them * chore: remove debug statement * test: add tests of the api_upload and api_submission_status endpoints * test: add tests and stubs for async side of submission handling * fix: gracefully handle (ignore) invalid IDs in async submit task * test: test process_uploaded_submission method * fix: fix failures of new tests * refactor: fix type checker complaints * test: test submission_status view of submission in "validating" state * fix: fix up migrations * fix: use the streamlined SubmissionBaseUploadForm for api_upload * feat: show submission history event timestamp as mouse-over text * fix: remove 'manual' as next state for 'validating' submission state * refactor: share SubmissionBaseUploadForm code with Deprecated version * fix: validate text submission title, update a couple comments * chore: disable requirements updating when celery dev container starts * feat: log traceback on unexpected error during submission processing * feat: allow secretariat to cancel "validating" submission * feat: indicate time since submission on the status page * perf: check submission rate thresholds earlier when possible No sense parsing details of a draft that is going to be dropped regardless of those details! * fix: create Submission before saving to reduce race condition window * fix: call deduce_group() with filename * refactor: remove code lint * refactor: change the api_upload URL to api/submission * docs: update submission API documentation * test: add tests of api_submission's text draft consistency checks * refactor: rename api_upload to api_submission to agree with new URL * test: test API documentation and submission thresholds * fix: fix a couple api_submission view renames missed in templates * chore: use base image + add arm64 support * ci: try to fix workflow_dispatch for celery worker * ci: another attempt to fix workflow_dispatch * ci: build celery image for submit-async branch * ci: fix typo * ci: publish celery worker to ghcr.io/painless-security * ci: install python requirements in celery image * ci: fix up requirements install on celery image * chore: remove XML_LIBRARY references that crept back in * feat: accept 'replaces' field in api_submission * docs: update api_submission documentation * fix: remove unused import * test: test "replaces" validation for submission API * test: test that "replaces" is set by api_submission * feat: trap TERM to gracefully stop celery container * chore: tweak celery/mq settings * docs: update installation instructions * ci: adjust paths that trigger celery worker image build * ci: fix branches/repo names left over from dev * ci: run manage.py check when initializing celery container Driver here is applying the patches. Starting the celery workers also invokes the check task, but this should cause a clearer failure if something fails. * docs: revise INSTALL instructions * ci: pass filename to pip update in celery container * docs: update INSTALL to include freezing pip versions Will be used to coordinate package versions with the celery container in production. * docs: add explanation of frozen-requirements.txt * ci: build image for sandbox deployment * ci: add additional build trigger path * docs: tweak INSTALL * fix: change INSTALL process to stop datatracker before running migrations * chore: use ietf.settings for manage.py check in celery container * chore: set uid/gid for celery worker * chore: create user/group in celery container if needed * chore: tweak docker compose/init so celery container works in dev * ci: build mq docker image * fix: move rabbitmq.pid to writeable location * fix: clear password when CELERY_PASSWORD is empty Setting to an empty password is really not a good plan! * chore: add shutdown debugging option to celery image * chore: add django-celery-beat package * chore: run "celery beat" in datatracker-celery image * chore: fix docker image name * feat: add task to cancel stale submissions * test: test the cancel_stale_submissions task * chore: make f-string with no interpolation a plain string Co-authored-by: Nicolas Giard <github@ngpixel.com> Co-authored-by: Robert Sparks <rjsparks@nostrum.com>
266 lines
8.4 KiB
Python
266 lines
8.4 KiB
Python
# Copyright The IETF Trust 2016-2020, All Rights Reserved
|
|
# -*- coding: utf-8 -*-
|
|
|
|
|
|
import bleach # type: ignore
|
|
import copy
|
|
import email
|
|
import re
|
|
import textwrap
|
|
import tlds
|
|
import unicodedata
|
|
|
|
from django.core.validators import URLValidator
|
|
from django.core.exceptions import ValidationError
|
|
from django.utils.functional import keep_lazy
|
|
from django.utils.safestring import mark_safe
|
|
|
|
import debug # pyflakes:ignore
|
|
|
|
from .texescape import init as texescape_init, tex_escape_map
|
|
|
|
tlds_sorted = sorted(tlds.tld_set, key=len, reverse=True)
|
|
protocols = copy.copy(bleach.sanitizer.ALLOWED_PROTOCOLS)
|
|
protocols.append("ftp") # we still have some ftp links
|
|
protocols.append("xmpp") # we still have some xmpp links
|
|
|
|
tags = set(copy.copy(bleach.sanitizer.ALLOWED_TAGS)).union(
|
|
{
|
|
# fmt: off
|
|
'a', 'abbr', 'acronym', 'address', 'b', 'big',
|
|
'blockquote', 'body', 'br', 'caption', 'center', 'cite', 'code', 'col',
|
|
'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em', 'font',
|
|
'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'head', 'hr', 'html', 'i', 'ins', 'kbd',
|
|
'li', 'ol', 'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike', 'style',
|
|
'strong', 'sub', 'sup', 'table', 'title', 'tbody', 'td', 'tfoot', 'th', 'thead',
|
|
'tr', 'tt', 'u', 'ul', 'var'
|
|
# fmt: on
|
|
}
|
|
)
|
|
|
|
attributes = copy.copy(bleach.sanitizer.ALLOWED_ATTRIBUTES)
|
|
attributes["*"] = ["id"]
|
|
attributes["ol"] = ["start"]
|
|
|
|
bleach_cleaner = bleach.sanitizer.Cleaner(
|
|
tags=tags, attributes=attributes, protocols=protocols, strip=True
|
|
)
|
|
|
|
validate_url = URLValidator()
|
|
|
|
|
|
def check_url_validity(attrs, new=False):
|
|
if (None, "href") not in attrs:
|
|
return None
|
|
url = attrs[(None, "href")]
|
|
try:
|
|
if url.startswith("http"):
|
|
validate_url(url)
|
|
except ValidationError:
|
|
return None
|
|
return attrs
|
|
|
|
|
|
bleach_linker = bleach.Linker(
|
|
callbacks=[check_url_validity],
|
|
url_re=bleach.linkifier.build_url_re(tlds=tlds_sorted, protocols=protocols),
|
|
email_re=bleach.linkifier.build_email_re(tlds=tlds_sorted), # type: ignore
|
|
parse_email=True,
|
|
)
|
|
|
|
|
|
@keep_lazy(str)
|
|
def xslugify(value):
|
|
"""
|
|
Converts to ASCII. Converts spaces to hyphens. Removes characters that
|
|
aren't alphanumerics, underscores, slash, or hyphens. Converts to
|
|
lowercase. Also strips leading and trailing whitespace.
|
|
(I.e., does the same as slugify, but also converts slashes to dashes.)
|
|
"""
|
|
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
|
|
value = re.sub(r'[^\w\s/-]', '', value).strip().lower()
|
|
return mark_safe(re.sub(r'[-\s/]+', '-', value))
|
|
|
|
def strip_prefix(text, prefix):
|
|
if text.startswith(prefix):
|
|
return text[len(prefix):]
|
|
else:
|
|
return text
|
|
|
|
def strip_suffix(text, suffix):
|
|
if text.endswith(suffix):
|
|
return text[:-len(suffix)]
|
|
else:
|
|
return text
|
|
|
|
def fill(text, width):
|
|
"""Wraps each paragraph in text (a string) so every line
|
|
is at most width characters long, and returns a single string
|
|
containing the wrapped paragraph.
|
|
"""
|
|
width = int(width)
|
|
paras = text.replace("\r\n","\n").replace("\r","\n").split("\n\n")
|
|
wrapped = []
|
|
for para in paras:
|
|
if para:
|
|
lines = para.split("\n")
|
|
maxlen = max([len(line) for line in lines])
|
|
if maxlen > width:
|
|
para = textwrap.fill(para, width, replace_whitespace=False)
|
|
wrapped.append(para)
|
|
return "\n\n".join(wrapped)
|
|
|
|
def wordwrap(text, width=80):
|
|
"""Wraps long lines without loosing the formatting and indentation
|
|
of short lines"""
|
|
if not isinstance(text, str):
|
|
return text
|
|
def block_separator(s):
|
|
"Look for lines of identical symbols, at least three long"
|
|
ss = s.strip()
|
|
chars = set(ss)
|
|
return len(chars) == 1 and len(ss) >= 3 and ss[0] in set('#*+-.=_~')
|
|
width = int(width) # ensure we have an int, if this is used as a template filter
|
|
text = re.sub(" *\r\n", "\n", text) # get rid of DOS line endings
|
|
text = re.sub(" *\r", "\n", text) # get rid of MAC line endings
|
|
text = re.sub("( *\n){3,}", "\n\n", text) # get rid of excessive vertical whitespace
|
|
lines = text.split("\n")
|
|
filled = []
|
|
wrapped = False
|
|
prev_indent = None
|
|
for line in lines:
|
|
line = line.expandtabs().rstrip()
|
|
indent = " " * (len(line) - len(line.lstrip()))
|
|
ind = len(indent)
|
|
if wrapped and line.strip() != "" and indent == prev_indent and not block_separator(line):
|
|
line = filled[-1] + " " + line.lstrip()
|
|
filled = filled[:-1]
|
|
else:
|
|
wrapped = False
|
|
while (len(line) > width) and (" " in line[ind:]):
|
|
linelength = len(line)
|
|
wrapped = True
|
|
breakpoint = line.rfind(" ",ind,width)
|
|
if breakpoint == -1:
|
|
breakpoint = line.find(" ", ind)
|
|
filled += [ line[:breakpoint] ]
|
|
line = indent + line[breakpoint+1:]
|
|
if len(line) >= linelength:
|
|
break
|
|
filled += [ line.rstrip() ]
|
|
prev_indent = indent
|
|
return "\n".join(filled)
|
|
|
|
|
|
# def alternative_wrap(text, width=80):
|
|
# # From http://blog.belgoat.com/python-textwrap-wrap-your-text-to-terminal-size/
|
|
# textLines = text.split('\n')
|
|
# wrapped_lines = []
|
|
# # Preserve any indent (after the general indent)
|
|
# for line in textLines:
|
|
# preservedIndent = ''
|
|
# existIndent = re.search(r'^(\W+)', line)
|
|
# # Change the existing wrap indent to the original one
|
|
# if (existIndent):
|
|
# preservedIndent = existIndent.groups()[0]
|
|
# wrapped_lines.append(textwrap.fill(line, width=width, subsequent_indent=preservedIndent))
|
|
# text = '\n'.join(wrapped_lines)
|
|
# return text
|
|
|
|
def wrap_text_if_unwrapped(text, width=80, max_tolerated_line_length=100):
|
|
text = re.sub(" *\r\n", "\n", text) # get rid of DOS line endings
|
|
text = re.sub(" *\r", "\n", text) # get rid of MAC line endings
|
|
|
|
width = int(width) # ensure we have an int, if this is used as a template filter
|
|
max_tolerated_line_length = int(max_tolerated_line_length)
|
|
|
|
contains_long_lines = any(" " in l and len(l) > max_tolerated_line_length
|
|
for l in text.split("\n"))
|
|
|
|
if contains_long_lines:
|
|
text = wordwrap(text, width)
|
|
return text
|
|
|
|
def isascii(text):
|
|
try:
|
|
text.encode('ascii')
|
|
return True
|
|
except (UnicodeEncodeError, UnicodeDecodeError):
|
|
return False
|
|
|
|
def maybe_split(text, split=True, pos=5000):
|
|
if split:
|
|
n = text.find("\n", pos)
|
|
text = text[:n+1]
|
|
return text
|
|
|
|
def decode(raw):
|
|
assert isinstance(raw, bytes)
|
|
try:
|
|
text = raw.decode('utf-8')
|
|
except UnicodeDecodeError:
|
|
# if this fails, don't catch the exception here; let it propagate
|
|
text = raw.decode('latin-1')
|
|
#
|
|
return text
|
|
|
|
def text_to_dict(t):
|
|
"Converts text with RFC2822-formatted header fields into a dictionary-like object."
|
|
# ensure we're handed a unicode parameter
|
|
assert isinstance(t, str)
|
|
d = {}
|
|
# Return {} for malformed input
|
|
if not len(t.lstrip()) == len(t):
|
|
return {}
|
|
lines = t.splitlines()
|
|
items = []
|
|
# unfold folded lines
|
|
for l in lines:
|
|
if len(l) and l[0].isspace():
|
|
if items:
|
|
items[-1] += l
|
|
else:
|
|
return {}
|
|
else:
|
|
items.append(l)
|
|
for i in items:
|
|
if re.match('^[A-Za-z0-9-]+: ', i):
|
|
k, v = i.split(': ', 1)
|
|
d[k] = v
|
|
else:
|
|
return {}
|
|
return d
|
|
|
|
def dict_to_text(d):
|
|
"Convert a dictionary to RFC2822-formatted text"
|
|
t = ""
|
|
for k, v in d.items():
|
|
t += "%s: %s\n" % (k, v)
|
|
return t
|
|
|
|
def texescape(s):
|
|
if not tex_escape_map:
|
|
texescape_init()
|
|
t = s.translate(tex_escape_map)
|
|
return t
|
|
|
|
def unwrap(s):
|
|
return s.replace('\n', ' ')
|
|
|
|
def normalize_text(s):
|
|
"""Normalize various unicode whitespaces to ordinary spaces"""
|
|
return re.sub(r'[\s\n\r\u2028\u2029]+', ' ', s, flags=re.U).strip()
|
|
|
|
def parse_unicode(text):
|
|
"Decodes unicode string from string encoded according to RFC2047"
|
|
|
|
decoded_string, charset = email.header.decode_header(text)[0]
|
|
if charset is not None:
|
|
try:
|
|
text = decoded_string.decode(charset)
|
|
except UnicodeDecodeError:
|
|
pass
|
|
else:
|
|
text = decoded_string
|
|
return text
|