Commit graph

52 commits

Author SHA1 Message Date
Henrik Levkowetz a485c74314 Merged in [14880] from rjsparks@nostrum.com:
Added a Draft test suite.
 - Legacy-Id: 14901
Note: SVN reference [14880] has been migrated to Git commit e09a28cad2
2018-03-22 16:34:10 +00:00
Russ Housley 565b10e00e Improve parser for references in Internet-Drafts. Fixes #2360
- Legacy-Id: 14851
2018-03-17 18:25:31 +00:00
Henrik Levkowetz 48fe02d58c Permit tildes in romanization of draft author names when looking for draft authors. Can be used in romanization of arabic names.
- Legacy-Id: 14256
2017-11-01 11:51:24 +00:00
Henrik Levkowetz 0e00adc5ee Another tweak to the draft author extraction code, to handle some name transliterations using multiple leading grave accents.
- Legacy-Id: 14149
2017-09-21 09:28:18 +00:00
Henrik Levkowetz 2c1438c240 Moved unidecode_name from utils.text to person.name.
Modified UserFactory to use a new locale for each new user, instead of the
same locale for a whole test run.  This (almost) ensures the exercise of
code to deal with non-ascii names, something which would not happen if a
locale with ascii names was chosen at the start of a run.

Modified name.initials() to not use non-word characters as initials.

Modified unidecode_name() to do more normalization, to conform to the
conventions used in internet-drafts.

Added saving of the factory-boy random state in order to be able to re-run
a test suite with the same pseudo-random sequence as in a previous failed
run.

Fixed an issue with email formatting in test_api_submit_ok().

Modified the draft author extraction code to deal better with names with
embedded apostrophes.
 - Legacy-Id: 14141
2017-09-20 15:36:30 +00:00
Henrik Levkowetz aafd6290a6 Added an option to ietf.utils.draft.Draft to pull document name from the source file name.
- Legacy-Id: 14089
2017-08-31 14:48:43 +00:00
Henrik Levkowetz b42f1cbeb5 Replaced the use of unaccent.asciify(), which has similar functionality to unidecode.unidecode(). Changed the draft parser to work exclusively with unicode text, which both makes the removal of unaccent easier, and takes us closer to Py35 compatibility. Adjusted callers of the draft parser to send in unicode.
- Legacy-Id: 13673
2017-06-18 18:23:18 +00:00
Henrik Levkowetz 76628be3fd Merged in ^/branch/iola/author-stats-r13145 from olau@iola.dk, and fixed some tests in code which moved after the latest merge with trunk. The test suite passes, but the migrations are _not_ ready to run, because of numbering conflicts (again due to code changes on trunk since the latest sync).
- Legacy-Id: 13479
2017-05-31 20:59:26 +00:00
Henrik Levkowetz 38bfdb4095 Fixed a bug in the earlier author extraction bugfix.
- Legacy-Id: 13295
2017-05-10 12:21:17 +00:00
Henrik Levkowetz fb70e9a4ff Fixed an issue with the author extraction code.
- Legacy-Id: 13288
2017-05-09 19:19:55 +00:00
Ole Laursen ef4d55f0c9 Apply patch from Henrik Levkowetz to fix some problems of author parse
errors where the affiliation is mistakenly thought to be an extra
author (some of these still remain)
 - Legacy-Id: 13142
2017-03-27 08:33:49 +00:00
Ole Laursen d2e85a3aa3 Apply draft parser patch from Henrik to improve the patch on trunk to
combine paragraphs across page splits - this makes the country part of
the parser find more countries
 - Legacy-Id: 12848
2017-02-15 19:10:59 +00:00
Ole Laursen b2ff10b0f2 Add support for extracting the country line from the author addresses
to the draft parser (incorporating patch from trunk), store the
extracted country instead of trying to turn it into an ISO country
code, add country and continent name models and add initial data for
those, add helper function for cleaning the countries, add author
country and continent charts, move the affiliation models to
stats/models.py, fix a bunch of bugs.
 - Legacy-Id: 12846
2017-02-15 18:43:57 +00:00
Henrik Levkowetz 44ad914fba Tweaked the company name extraction code in class Draft.
- Legacy-Id: 12842
2017-02-15 14:09:54 +00:00
Henrik Levkowetz bb5e5b97ba Another tweak to handle page break paragraph joins better in class Draft.
- Legacy-Id: 12840
2017-02-14 17:41:30 +00:00
Henrik Levkowetz 6158221fa8 Tweaked the author extraction to recognize short lines as paragraph ends, not only lines ending in '.' or ':'
- Legacy-Id: 12837
2017-02-14 14:23:15 +00:00
Ole Laursen aebfe44f9e Add simple detection of formal languages used in draft, partially
based on the code in getauthors by Jari Arkko
 - Legacy-Id: 12657
2017-01-16 16:08:56 +00:00
Ole Laursen 34a9f36534 Add helper for getting word count from draft
- Legacy-Id: 12655
2017-01-16 11:35:48 +00:00
Henrik Levkowetz 887455c1d5 Make sure to not include draft name in the title extracted from draft text.
- Legacy-Id: 12176
2016-10-19 12:18:59 +00:00
Henrik Levkowetz f5ca3a12bc Fixed a bug in the header/footer stripping done before abstract extraction when a draft is submitted.
- Legacy-Id: 10519
2015-11-24 20:01:31 +00:00
Henrik Levkowetz 1bf4356002 Improved regex for the Dr.-Ing. honorific fix.
- Legacy-Id: 8509
2014-10-29 06:53:34 +00:00
Henrik Levkowetz 770f79e601 Added 'Dr.-Ing.' to the recognised honorifics in the author extraction code.
- Legacy-Id: 8508
2014-10-29 06:24:41 +00:00
Henrik Levkowetz 46cb5cbdca Did a number of changes to the author extraction method of class Draft in order to make it able to match up names with double-word family names on the first page (A. Foo Bar) with (familyname, given-name) ordering (Foo Bar Any) in the Authors' Addresses section. Regression tested against 200+ known good author extraction results. A number of stronger restrictions in regular expressions had to be introduced to avoid regression, which is probably all to the good.
- Legacy-Id: 8507
2014-10-28 15:45:47 +00:00
Henrik Levkowetz e3077c6e50 Fixed a bug in the new ISO-date code for draft metadata extraction.
- Legacy-Id: 8502
2014-10-27 17:01:16 +00:00
Henrik Levkowetz 4dddf14be0 Added support for ISO-format dates (or RFC 3339 dates, if you will) to the date parsing done for the submission tool. Also refined the regexes a bit to avoid false matches on for instance things like 'Juniper 2014'.
- Legacy-Id: 8501
2014-10-27 16:51:19 +00:00
Henrik Levkowetz 9d5a9c143e Reverted changes in ietf/utils/draft.py which should not have been part of [8499].
- Legacy-Id: 8500
Note: SVN reference [8499] has been migrated to Git commit a8ddac15e2
2014-10-27 16:35:50 +00:00
Henrik Levkowetz a8ddac15e2 Merged in [8498] from rjsparks@nostrum.com:\n Reworked logic flow for editing shepherds. Added message to inform the user when the shepherd is not changed. Fixes bug #1508.
- Legacy-Id: 8499
Note: SVN reference [8498] has been migrated to Git commit 055202dee4
2014-10-27 16:01:51 +00:00
Henrik Levkowetz 8c42989d5d Pyflakes cleanup compliant with pyflakes 0.8.1, which seems to find things 0.8.0 didn't fin.
- Legacy-Id: 7558
2014-04-01 16:25:18 +00:00
Henrik Levkowetz 49edc7404e Made ietf/utils pyflakes-clean.
- Legacy-Id: 7496
2014-03-16 07:26:03 +00:00
Henrik Levkowetz 258ac770b3 Better handling of draft name extraction when there's no extension given.
- Legacy-Id: 6675
2013-11-06 22:18:51 +00:00
Robert Sparks e309ff92b3 Don't insert references to self.
Move the data filler from a migration to a standalone script
 - Legacy-Id: 6620
2013-11-02 20:59:43 +00:00
Robert Sparks b18249222b Refines Bill Fenner's regex based search through documents for references.
Populates RelatedDocument with relations for references for each type draft Document.
Replaces these reference relationships with updated copies on draft submission.
Note to deployer: This migration takes around 10 minutes to complete on a fast development laptop.
 - Legacy-Id: 6572
2013-10-30 20:51:11 +00:00
Henrik Levkowetz 3020c5f7eb Imported a new version of the draft metadata extraction module, which
calculates page numbers more reliably, doesn't include duplicates in
the list of referenced drafts, and other minor tweaks.
 - Legacy-Id: 6362
2013-10-04 13:50:14 +00:00
Henrik Levkowetz c4015a302b Added variations on the recognized date formats during submitted draft parsing, such that comma need not be followed by whitespace in the formats using comma as a separator between some of the fields. Added extraction of drafts referenced by a document, in addition to RFCs referenced.
- Legacy-Id: 5456
2013-02-24 20:17:22 +00:00
Henrik Levkowetz 4946a3f694 Updated draft submission author extraction module to handle dash-separated double given names.
- Legacy-Id: 5088
2012-12-03 13:17:33 +00:00
Henrik Levkowetz 45585957ef Added support for reverse-order (i.e., Japanese, Chinese, and other) names with uppercase family name in the draft submission author extraction.
- Legacy-Id: 4949
2012-10-23 12:33:21 +00:00
Henrik Levkowetz 7467fa48a5 Tweaked the author extraction code to handle company names in the author list on the first page, when the company names contain a comma, such as for instance 'Foo Bar, Inc'.
- Legacy-Id: 4781
2012-08-22 12:52:32 +00:00
Henrik Levkowetz eb28ac8177 Removed the ValueError exceptions introduced in the previous revision of the draft author extraction code. Fixes issue #858.
- Legacy-Id: 4753
2012-08-06 15:16:53 +00:00
Henrik Levkowetz 0c49999fc9 Updated utils/draft.py and modified the submit app code accordingly.
New features (keep in mind that utils/draft.py can be run standalone
to do extraction of draft author data, too):

  * The handling of author info formatted in columns causes problems
    in the face of an author named for instance A. Author with the
    company 'Al Author and Associates', causing breakage of email
    addresses longer than 'Al Author and'.  Tweaked the recognition
    of column data to require multiple (not only one) space around
    'and'.

  * Added support for extraction of author affiliation.

  * Tweaked the meaning of -t, --timestamp and added --notimestamp; and
    made the default be to emit leading timestamps based ont the draft
    file time.

  * Added support for running author extraction on RFCs, by not bailing
    out on not finding a draft name when RFC information is available.

  * Added support for additional date formats and author name formats.

  * Improved creation date extraction -- previously, the first supported
    date format which was recognized on the first page of the draft would
    be used, rather than the first date in a supported format.  This could
    cause errors if the Status of Memo section or Abstract contained a
    date occurring at the start of a line.

  * Tweaked the honorific regex to make things work better for the case
    when the full name in the author's address section includes a first
    name which isn't part of the first-page abbreviated name.  Fixes
    problems with draft-chiappa-lisp-introduction and similar.

  * Added a special case for people who provide their email address as
    'foo&cisco.com' instead of 'foo@cisco.com'.  Bah.

  * Added an alternative, more human-readable key-value-pair attribute
    output mode with a '-a' switch.

  * Tweaded the first-name regex to capture cases where the first name
    is indicated with an alternate first letter: 'Y(J) Stein'.  Fixes
    problems with draft-anavi-tdmoip and similar.
 - Legacy-Id: 4612
2012-07-11 12:51:33 +00:00
Henrik Levkowetz f46f893de9 Don't try to output draft metadata (in standalone mode) for a file if the extraction failed.
- Legacy-Id: 3505
2011-10-25 13:58:55 +00:00
Henrik Levkowetz 43f1d5da93 Improved extraction of draft title during submission. Fixed a problem where the scan for an author's email address was prematurely terminated because another author's affiliation also was part of this author's address information.
- Legacy-Id: 3445
2011-10-13 14:36:26 +00:00
Henrik Levkowetz 7f8eea3b9d * Speeded up things and increased reliability by looking for a
recognizable author's address section, and not searching for
    author names earlier in the document if found.  Fixes a known
    bad case where the author name occurred in the middle of a draft.

  * Added handling for the case where an author name is followed by 
    parentheses which are not closed on the same line.

  * Some refactoring.
 - Legacy-Id: 3417
2011-09-14 12:31:48 +00:00
Henrik Levkowetz 494b3c77fd Fix a problem with author extraction when a given name is the same as the surname.
- Legacy-Id: 3135
2011-05-23 21:42:34 +00:00
Henrik Levkowetz 101fe5f3dd When extracting meta-information from drafts, it is required that some data reside on the first page. Split unpaginated drafts into chunks so we can adhere better to this.
- Legacy-Id: 3083
2011-05-03 14:10:43 +00:00
Henrik Levkowetz fa16a7b0c1 Change ietf/utils/draft.py to provide an alternative method to get author
information: draft.get_author_info().  This method returns a list of
(full_name, first_name, middle_part, surname, suffix, email), with
middle_part, suffix and email set to None if none was found.
 - Legacy-Id: 2921
2011-03-24 13:25:14 +00:00
Henrik Levkowetz 79a283c3f6 Add the fix for email addresses from [2892].
- Legacy-Id: 2920
Note: SVN reference [2892] has been migrated to Git commit db905d3903
2011-03-24 13:09:01 +00:00
Henrik Levkowetz 0b8bcfa81d Fix a series of issues found during testing. This is the patch provide
to Yaco on 2011-03-19, and committed on branch/yaco/idsubmit as [2896].

   * Extraction of Title which don't have the draft name on a separate
     page fails.  See for instance this example:
     http://www.ietf.org/staging/draft-ma-cdni-publisher-use-cases-00.txt
     The regex should maybe be updated to permit but not require a newline
     before the draft filename:
     '(?:\n\s*\n\s*)((.+\n){1,2}(.+\n?))(\s+<?draft-\S+\s*\n)\s*\n'

   * If there are blank lines before the start of the author list on the
     first page, the author extraction will fail.  This sometimes happens
     when there's junk at the start of a draft, see for instance
     http://www.ietf.org/id/draft-ietf-mpls-tp-process-00.txt .

   * Sometimes the Authors' Addresses section lists authors with the same
     workplace address on the same line: "Sam Spade and Joe Smith".  This
     needs a fix in the author extraction code.

   * Sometimes the order of first name, surname is different on the first
     page and in the author list, and sometimes the surname is uppercase
     in one place, but not in the other.  This also needs a fix in the
     author extraction code.

   * The header stripping code had a bug, where multiple blank lines could
     be replaced by a single blank line in the stripped text, which could
     mess up title extraction.

   * Title space normalization should be done also for titles from the
     'unusual title format' code branch of the title extraction code.

   * Company names on the first page are sometimes rendered with different
     case than in the Authors' Addresses section.

   * Some drafts list the draft filename _before_ the title, rather than
     after the title.  Permit this too. Covered in the patch.

   * Spanish names can be shown as either
	<given_name> <fathers_first_surname> <mothers_first_surname>
     or less formally as
	<given_name> <fathers_first_surname>
     If the first form is used in the Authors' Addresses section, but the
     second form (with the given name possibly abbreviated to its first
     letter) the author extraction will fail.

   * Drafts containing tabs will be caught by idnits during I-D submission,
     but in case the drafts.py module is used independently from idnits,
     convert tabs to spaces in order for the author extraction and other
     methods to work as expected.  Example: recently submitted draft
     draft-bergeron-payload-rtpfec-rs-00.txt.

   * Found a draft with a previously unhandled header/footer format:
     draft-fang-mpls-tp-oam-toolset-01.txt.  Tweak needed for header/footer
      stripping.
 - Legacy-Id: 2919
Note: SVN reference [2896] has been migrated to Git commit 5a34b70e52
2011-03-24 13:05:48 +00:00
Henrik Levkowetz 61300a9354 Fix title extraction. Patch provided to yaco 2011-03-14, committed to yaco/idsubmit branch as [2887].
- Legacy-Id: 2918
Note: SVN reference [2887] has been migrated to Git commit fb7219c6ce
2011-03-24 12:58:33 +00:00
Henrik Levkowetz 9ae7b90b59 Merged in changes from Yaco @2880.
- Legacy-Id: 2917
2011-03-24 12:54:31 +00:00
Henrik Levkowetz 265b94c4ca Bugfix for faulty header/footer stripping, sent to Yaco 2011-03-02.
- Legacy-Id: 2916
2011-03-24 12:53:19 +00:00