Feed is not ovewritten when custom extension is used
See original GitHub issueDescription
I’m trying to export scrapy crawl
results to JSON Lines format to the file with extension .jsonl
(this is requirement of the external system in our case) and ovewrite the file for multiple executions.
As I understand, only .jl
and .jsonlines
extensions are supported now and .jsonl
was discussed in #4848 but not supported yet.
So in this case I tried to use -O
argument with --output-format
for scrapy crawl
command.
Steps to Reproduce
scrapy crawl -O <filename>.jsonl --output-format jl <spider_name>
ORscrapy crawl -O <filename>.jsonl --output-format jsonlines <spider_name>
Expected behavior: File is ovewritten with parsed content.
Actual behavior: Parsed content is appended to the end of existing file.
Reproduces how often: 100%
Versions
Scrapy : 2.6.1 lxml : 4.9.0.0 libxml2 : 2.9.14 cssselect : 1.1.0 parsel : 1.6.0 w3lib : 1.22.0 Twisted : 22.4.0 Python : 3.10.4 (main, Jun 4 2022, 14:29:37) [GCC 9.4.0] pyOpenSSL : 22.0.0 (OpenSSL 3.0.3 3 May 2022) cryptography : 37.0.2 Platform : Linux-5.13.0-44-generic-x86_64-with-glibc2.31
Additional context
If I use scrapy crawl -O <filename>.jl <spider_name>
or scrapy crawl -O <filename>.jsonlines <spider_name>
, file is overwritten successfully, but it seems that the case above is expected to have the same behaviour.
Issue Analytics
- State:
- Created a year ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
Thanks a lot for your feedback, Sorry I really didn’t notice this warning because of a lot of log messages.
Your approach is working, thank you!
I tried to find where this syntax with colon is mentioned in documentation but didn’t find any information about it.
I have addressed this issue in PR #5605