# -*- coding: utf-8 -*-
from setuptools import setup

modules = \
['itemsubjector']
install_requires = \
['console-menu>=0.7.1,<0.8.0',
 'pandas>=1.5.0,<2.0.0',
 'pydantic>=1.10.2,<2.0.0',
 'rich>=12.5.1,<13.0.0',
 'setuptools>=65.4.1,<66.0.0',
 'wikibaseintegrator>=0.12.1,<0.13.0']

setup_kwargs = {
    'name': 'itemsubjector',
    'version': '0.3.1',
    'description': 'CLI-tool to easily add "main subject" aka topics in bulk to groups of items on Wikidata',
    'long_description': '# ItemSubjector\nThe purpose of this command-line tool is to add main subject statements to Wikidata \nitems based on a heuristic matching the subject with the title of the item. \n![bild](https://user-images.githubusercontent.com/68460690/133230724-40a610b7-5557-4b2b-b66e-2d80ca89e90d.png)\n*The tool running in PAWS adding manually found main subject QIDs*\n![bild](https://user-images.githubusercontent.com/68460690/155840858-057292a5-8647-415f-8df3-7bbb90884dbc.png)\n*Itemsubjector running GNU Screen on a Toolforge bastion with --limit 100000 and \n--sparql matching the WHO list of essential medicines.*\n\n# Background\nAs of september 2021 there were 37M scientific articles in Wikidata, but 27M of them were missing any main \nsubject statement. That makes them very hard to find for scientists which is bad for science, \nbecause building on the work of others is essential in the global scientific community.\n\nTo my knowledge none of the scientific search engines that are currently used in the scientific community rely on an\nopen graph editable by anyone and maintained by the community itself for the purpose of helping fellow\nscientists find each others work. Wikipedia and Scholia can fill that gap but we need good tooling to curate the \nmillions of items.\n\n# Features\nThis tool has the following features:\n* Adding a list of manually supplied main subjects to a few selected subgraphs \n  (These currently include a total of 37M items with scholarly items being the biggest subgraph by far).\n* Matching against a set of items fetched via a SPARQL query.\n* Matching up to a limit of items which together with Kubernetes makes it possible to start a query which \ncollects jobs with items until the limit is reached and then ask for approval/decline of each job. This \nenables the user to create large batches of jobs with 100k+ items in total in a matter of minutes.\n* Batch mode that can be used together with the above features and be run non-interactively \n  e.g. in the Wikimedia Cloud Services Kubernetes Beta cluster\n\nIt supports \n[Wikidata:Edit groups](https://www.wikidata.org/wiki/Wikidata:Edit_groups) \nso that batches can easily be undone later if needed. \nClick "details" in the summary of edits to see more.\n\n# Installation\nDownload the release tarball or clone the tool using Git.\n\n## Clone the repository \n`git clone https://github.com/dpriskorn/ItemSubjector.git && cd ItemSubjector`\n\nThen checkout the latest release. \n\n`git checkout v0.x` where x is the latest number on the release page.\n\n## Setup the environment\n\nMake a virtual environment and set it up using poetry. If you don\'t have poetry installed run:\n`$ pip install poetry`\n\nand then setup everying with\n\n`$ poetry install`\n\nto install all requirements in a virtual environment.\n\n## PAWS\n*Note: PAWS is not ideal for batch jobs unless you \nare willing to keep your browser tab open for the \nwhole duration of the job. Consider using Kubernetes \ninstead, see below*\n\nThe tool runs in PAWS with no known \nissues.\n* log in to PAWS\n* open a terminal\n* run `git clone https://github.com/dpriskorn/ItemSubjector.git .itemsubjector && cd .itemsubjector && pip install -r requirements.txt` \n  <- note the dot in front of the directory name \n  that hides it from publication which is crucial to \n  avoid publication of your login credentials.\n* follow the details under Setup below\n\n\n## Wikimedia Cloud Services Kubernetes Beta cluster\nSee [Kubernetes_HOWTO.md](Kubernetes_HOWTO.md)\n\n# Setup\nSetup the config by copying config/config.example.py -> \nconfig/__init__.py and enter the botusername \n(e.g. So9q@itemsubjector) and password \n(first [create a botpassword](https://www.wikidata.org/wiki/Special:BotPasswords) \nfor your account \nand make sure you give it the *edit page permission* \nand *high volume permissions*)\n* e.g. `cp config.example.py config.py && nano config.py`\n\n*GNU Nano is an editor, press `ctrl+x` when you are done and `y` to save your changes*\n\n# Use\nThis tool helps by adding the \nvalidated or supplied QID to all \nscientific articles where the \nsearch string appears (with \nspaces around it or in the beginning\nor end of the string) in the label \nof the target item (e.g. scientific article).\n\n## Adding QIDs manually\n*Always provide the most precise subjects first*\n\nRun the script with the -a or --add argument \nfollowed by one or more QIDs or URLS:\n* `python itemsubjector.py -a Q108528107` or\n* `python itemsubjector.py -a https://www.wikidata.org/wiki/Q108528107`\n\n*Note since v0.2 you should not add subjects that are subclass \nof each other in one go. \nThis is because of internal changes related to job handling*\n\nAdd the narrow first and then the broader like this:\n* `python itemsubjector.py -a narrow-QID && python itemsubjector.py -a broader-QID`\n\nPlease investigate before adding broad \nsubjects (with thousands of matches) \nand try to nail down specific \nsubjects and add them first. If you are \nunsure, please ask on-wiki or in the \n[Wikicite Telegram group](https://meta.wikimedia.org/wiki/Telegram)\n\n### Disable alias matching\nSometimes e.g. for main subjects like \n[Sweden](https://www.wikidata.org/wiki/Q34) \nit is necessary to disable alias matching to \navoid garbage matches. \n\nUsage example:\n`python itemsubjector.py -a Q34 --no-aliases` \n(the shorthand `-na` also works)\n\n### Disable search expression confirmation\nAvoid the extra question "Do you want to continue?":\n\nUsage example:\n`python itemsubjector.py -a Q34 --no-confirmation` \n(the shorthand `-nc` also works)\n\n### Show links column in table of search expressions \nThis is handy if you want to look them up easily.\n\nUsage example:\n`python itemsubjector.py -a Q34 --show-search-urls` \n(the shorthand `-su` also works)\n\n### Show links column in table of search expressions \nThis is handy if you want to look them up easily.\n\nUsage example:\n`python itemsubjector.py -a Q34 --show-item-urls` \n(the shorthand `-iu` also works)\n\n### Limit to scholarly articles without main subject\nUsage example:\n`python itemsubjector.py -a Q34 --limit-to-items-without-p921` \n(the shorthand `-w` also works)\n\n## Matching main subjects based on a SPARQL query.\nThe tool can create a list of jobs by picking random subjects from a\nusers SPARQL query.\n\nUsage example for diseases:\n`python itemsubjector.py -iu --sparql "SELECT ?item WHERE {?item wdt:P31 wd:Q12136.}"`\n\nThis makes it much easier to cover a range a subjects. \nThis example query returns ~5000 items to match :)\n\n## Batch job features\nThe tool can help prepare jobs and then run \nthem later non-interactively. This enables the user\nto submit them as jobs on the Wikimedia Cloud Service \nBeta Kubernetes cluster, so you don\'t \nhave to run them locally if you don\'t want to.\n\nSee the commands below and \nhttps://phabricator.wikimedia.org/T285944#7373913 \nfor details.\n\n*Note: if you quit/stop a list of jobs that are \ncurrently running, please remove the \nunfinished prepared jobs before preparing \nnew jobs by running --remove-prepared-jobs*\n\n## List of all options\nThis is the output of `itemsubjector.py -h`:\n```buildoutcfg\nusage: itemsubjector.py [-h] [-a ADD [ADD ...]] [-na] [-nc] [-p] [-r] [-rm] [-m] [-w] [-su] [-iu] [--sparql [SPARQL]] [--debug-sparql]\n                        [--no-ask-match-more-limit [NO_ASK_MATCH_MORE_LIMIT]] [--export-jobs-to-dataframe]\n\nItemSubjector enables working main subject statements on items based on a\nheuristic matching the subject with the title of the item.\n\nExample adding one QID:\n\'$ itemsubjector.py -a Q1234\'\n\nExample adding one QID and prepare a job list to be run non-interactively later:\n\'$ itemsubjector.py -a Q1234 -p\'\n\nExample working on all diseases:\n\'$ itemsubjector.py --sparql "SELECT ?item WHERE {?item wdt:P31 wd:Q12136. MINUS {?item wdt:P1889 [].}}"\'\n\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -a ADD [ADD ...], --add ADD [ADD ...], --qid-to-add ADD [ADD ...]\n                        List of QIDs or URLs to Q-items that are to be added as main subjects on scientific articles. Always add the most specific ones first. See the\n                        README for examples\n  -na, --no-aliases     Turn off alias matching\n  -nc, --no-confirmation\n                        Turn off confirmation after displaying the search expressions, before running the queries.\n  -p, --prepare-jobs    Prepare a job for later execution, e.g. in a job engine\n  -r, --run-prepared-jobs\n                        Run prepared jobs non-interactively\n  -rm, --remove-prepared-jobs\n                        Remove prepared jobs\n  -m, --match-existing-main-subjects\n                        Match from list of 136.000 already used main subjects on other scientific articles\n  -w, --limit-to-items-without-p921\n                        Limit matching to scientific articles without P921 main subject\n  -su, --show-search-urls\n                        Show an extra column in the table of search strings with links\n  -iu, --show-item-urls\n                        Show an extra column in the table of items with links\n  --sparql [SPARQL]     Work on main subject items returned by this SPARQL query. Note: "?item" has to be selected for it to work, see the example above. Note: MINUS {?item\n                        wdt:P1889 [].} must be present in the query to avoid false positives.\n  --debug-sparql        Enable debugging of SPARQL queries.\n  --no-ask-match-more-limit [NO_ASK_MATCH_MORE_LIMIT], --limit [NO_ASK_MATCH_MORE_LIMIT]\n                        When working on SPARQL queries of e.g. galaxies, match more until this many matches are in the job list\n  --export-jobs-to-dataframe\n                        Export the prepared job list to a Pandas DataFrame.\n```\n# What I learned\n* I used the black code-formatter for the first time in this project and \nit is a pleasure to not have to sit and manually format the code anymore.\n  \n* I used argparse for the first time in this project and how to type it \n  properly.\n  \n* This was one of the first of my projects that had scope creep. I have \nremoved the QuickStatements export to simplify the program.\n  \n* This project has been used in a scientific paper I wrote together with \n[Houcemeddine Turki](https://scholia.toolforge.org/author/Q53505397)\n\n# Thanks\nDuring the development of this tool the author got a \nhelp multiple times from **Jan Ainali** and **Jon Søby**\nwith figuring out how to query the API using the \nCirrusSearch extensions and to remove more \nspecific main subjects from the query results.\n\nA special thanks also to **Magnus Sälgö** and **Arthur Smith** for their\nvaluable input and ideas, e.g. to search for aliases also and to *Jean* and the \nincredibly\nhelpful people in the Wikimedia Cloud Services Support chat that\nhelped with making batch jobs run successfully.\n\nThanks also to **jsamwrites** for help with testing and suggestions \nfor improvement.\n\n# License\nGPLv3+\n\n',
    'author': 'Dennis Priskorn',
    'author_email': '68460690+dpriskorn@users.noreply.github.com',
    'maintainer': 'None',
    'maintainer_email': 'None',
    'url': 'https://github.com/dpriskorn/ItemSubjector',
    'py_modules': modules,
    'install_requires': install_requires,
    'python_requires': '>=3.8,<3.11',
}


setup(**setup_kwargs)
